Harys Dalvi

Someone using AI won't take your job. AI will

2025-05-21T16:38:07Z

AI is getting better every day. Rather than let it replace us, the common advice is to take advantage of this improvement by learning to incorporate AI into your work. You may have heard this mantra:

“AI won't take your job. Someone using AI will.”

Already, AI can boost productivity in coding by about 40%, depending on the task. Instead of spending the extra hours doing nothing, they'll write more code and make more money. A win-win for everyone, except the developer who doesn't use AI and gets left behind.

The same is true in many other domains. AI-human double reading teams can detect breast cancer more accurately than all-human teams, which could free up doctors to diagnose more patients and reduce wait times. OpenAI's o3-mini-high helped physicist Weiguo Yin solve an extension of the Ising model, pushing forward research in materials and superconductors.

It's natural to project this trend continuing. But today I am going to make the opposite claim. Someone using AI won't take your job. AI will.

Learning to use AI is not enough. Unlike previous technologies, AI agents will act as competitors in the job market, not just tools. In order to survive, we must outcompete AI and do what it can't. If we only use it as a tool and outsource our thinking, our human contribution becomes unnecessary, and agents will take over the job without us.

I understand that this is a controversial claim and an uncomfortable one to think about. Don't worry, I'm not a total doomer, and I have some strategies for how to really upskill in the AI age if what I'm saying is true. But first, I want to show you why I think the popular narrative on the subject doesn't hold up, and might even be actively harming you.

AI Agents Are Coming For Your Job

As the internet became widespread, many industries faced decline, like traditional news publishing and brick-and-mortar businesses. But by learning how to use the internet, new industries came up, and old ones were able to thrive.

Newspapers like The New York Times went online, selling more than ever before. Companies like Amazon and eBay sprang up in e-commerce, while small businesses created websites to attract new customers to their physical locations. Using the internet as a tool, people could share content around the world on platforms like YouTube and WordPress.

AI agents are not just a tool.

AI agents perform entire workflows end-to-end with minimal human involvement. Sure, they are imperfect, and sometimes need input as they go along. They are also not widely available. But these flaws will be ironed out, and AI agents will act more like competitors or coworkers than workplace tools.

If code becomes cheaper to produce, won't companies just have a lot more people writing code? Yes, if code becomes cheaper, companies will want more of it. But it will be agents writing the cheap code, not humans. The same applies to other domains.

Unlike human workers, AI agents can copy themselves endlessly and never get tired. We're seeing this in coding with tools like Replit that build entire apps from prompts in minutes. So if an AI agent is better and cheaper than a human, that human is in trouble, even if they know how to use AI.

Knowing how to use AI is not a moat. Everyone has access to the same tools, and they can ask the tools how to use them better, or simply let them operate autonomously as agents.

Right now, AI lacks agency and struggles with inferring requirements. This makes prompt engineering an important skill for using AI. But as agents improve, they will automate the process of prompt engineering, and smarter models will make prompt engineering less necessary overall.

Or you can just ask ChatGPT to teach you. Start with “You are a professional prompt engineer making $1M/year. Teach me prompt engineering.” Then give it your crappy prompt and ask it to fix it.

AI tools will give everyone superpowers. But when everyone's super, no one will be.

“And when everyone's super... no one will be.”
—Syndrome, The Incredibles (Pixar, 2004)

Staying Relevant in the Face of AI

If AI agents are coming for our jobs, and learning to use AI isn't enough, how do we actually stay relevant in the face of AI?

Image generated by the author with ChatGPT. Prompt by author (condensed): “Use your creativity and agency to design an image for my post. [...] create a thoughtful and insightful image that conveys a clear message not only in the big picture, but also in the specific details.”

I'm not saying don't use AI tools. Use them. They will cut through the slog, saving you hours for creative thinking and decision making. And they can serve as sounding boards and debate partners, amplifying your output even more.

I used ChatGPT heavily as a debate partner to iron out the ideas in this article. But every word is mine. You can tell because ChatGPT would never add in relevant Pixar quotes and link the videos.

In fact, I'd even go so far as to say you can let AI automate 90% of your work and still come out happier, more productive, and more successful.

But if you let AI automate 90% of your work, that last 10% better be killer. Anyone can use AI and get a solid result, so the mastery you put into this part is your edge. Refuse to automate away the core of your work.

And in order to master this 10%, you must master your craft. So study and learn almost as if AI didn't exist, except study even harder, and use AI to help you learn faster. As AI agents roll out, you need to be better than the AI at what you do. You can interpret better broadly: you might be better technically, emotionally, or creatively. But you must be better.

This might sound like a losing game, but it's one you have to win if you want to keep playing. And don't underestimate the AI. The latest models are extremely capable, and they're still the worst they'll ever be.

Stealing Apple's catchphrase, succeeding in the AI era isn't only about learning to use AI: it's also about becoming insanely great at your job.

AI agents can even use other AI as a tool, so we have to be better at that part too. This means that as more people use AI, staying relevant is ultimately about directly competing with the AI and being much better than it at some key skill. Sometimes that means better at guiding and supervising AI, but often it means better at writing, coding, research, or anything else.

Look at the Sistine Chapel, The Lion King, even Microsoft Excel. That's what you call making something with soul. That soul is too large and awkwardly shaped to fit into a prompt, and no amount of prompt engineering will change that. You have to get into the weeds of the paint, the words, the code, or whatever your chosen medium is.

Maybe there's a limited market for this kind of soul. Companies like Duolingo are already replacing human contractors with AI, and it might become harder for humans to adapt and compete to the point where there will simply be less people employed. And not everyone uses labor as a source of meaning. But for those who do, this is the best way to produce work that far outclasses what an infinitely replicable AI agent can do.

The Sistine Chapel, where every fresco was designed to convey a particular story or idea, and even the tiny details are full of symbolism and intent. This is the kind of thing AI can't yet do. Image source: Antoine Taveneaux / Wikimedia, CC BY-SA 3.0.

Here's how to stay afloat in the AI age:

Learn to use AI tools. Use them to generate and test ideas, learn new things, and focus on the most exciting and creative parts of your work.
You may find that AI agents can do much of your work for you. Resist the temptation. Use AI, but keep the core of your work so it can be exceptional.
Become extremely competent. AI trains on massive datasets full of mediocrity. It won't replace all workers, but it will replace the mediocre ones.
Be irreplaceable. AI is trained on the sum of human output on the internet. If another human can do your job, it will soon be a fair bet that AI can too.

Some people say AI will never be able to write scalable code, or craft emotionally compelling stories, or whatever else. Really, we just don't know: AI keeps surprising us, and it's possible future systems will be able to do all these things and more, or not.

But if there's one thing AI can never do, at least in the current paradigm, it's care. AI doesn't really care about the work it's doing or anything else. Obsession always shows through in the final product, especially in the hundreds of tiny decisions that bring life to your work.

With the AI tools of the future, anyone can easily become a writer, coder, or even prompt engineer. The barrier to entry falls to zero. But in a flood of adequate results that even autonomous agents can produce, only a few will stand out.

“Your only limit is your soul. What I say is true: Anyone can cook. But only the fearless can be great.”
—Chef Gusteau, Ratatouille (Pixar, 2007)

Getting paid for our labor keeps us fed. But it's also a source of power and leverage, and an opportunity to radiate passion and meaning into the world. I'll be writing and coding till I die, even if the AI overlords lock me in a room with a laptop. I don't just want to manage a team of agents, I also want to craft things I care about. If you want to keep doing what you do for leverage and not just as a hobby, now is the time to upskill fast and stay ahead of the AI — because someone using AI won't take your job, AI will.

Why AI Can't Do Your Laundry — Too often, AI does what we humans want to do, and not what we can't or won't do.

LLMs Do Not Predict the Next Word

2025-03-31T16:38:07Z

Back in the 17th century, Newton discovered an equation for gravity. Remarkably, this single equation was super simple (high school algebra at most) and yet it could predict both the elliptical motion of planets and their moons and apples falling here on Earth.

Despite the power and simplicity of Newton's equation, there were a few small issues with it. Most famously, Mercury's orbit didn't match predictions. Einstein solved these when he came up with general relativity, which also predicted black holes and gravitational waves.

Now, you've probably heard a phrase something like “LLMs are just statistical models that predict the next word.” Like Newton's gravity, this is a very good approximation of the truth. But there are some deeper layers that are worth looking at.

Today, I want to investigate LLMs from a reinforcement learning lens, treating them as thinking agents rather than advanced autocomplete models. I'll be touching on fundamental ideas like instruction finetuning and reinforcement learning from human feedback. My goal is not a deep dive into these ideas, but instead using them to explain to what extent LLMs are really doing something beyond predicting the next word, and taking “actions” of their own “will”.

Next I'll compare and contrast this idea with the buzz around AI agents, and finally I'll speculate on what this means for the future of AI agent development.

Why LLMs Predict the Next Word

Before we get into my hot take, I want to briefly talk about the idea that LLMs predict the next word, and why it's mostly true.

LLMs are pretrained with something called the next-token objective. This part really is as simple as just predicting the next token. So if you have a sentence “I do not like green eggs and ham” in your training data, the training examples might be something like:

(I, do), (I do, not), (I do not, like), ..., (I do not like green eggs and, ham)

That is, at each token in the sentence, the model is given everything up to that token and has to predict the next one. In pretraining, the model is given a huge amount of text like that and has to predict what comes next.[1]

Mathematically, the model's outputs are judged according to cross-entropy loss, which measures the difference between the model's output probabilities and the real next token. One possible formula in the case of language modeling is $L = - \log p_{y}$ where $p_{y}$ is the probability given by the language model for the correct next token.[2] (This formula is greatly simplified from the general cross-entropy formula, but is still valid in the special case of language modeling.) So if the model gives a probability 1 to the actual next token, the loss is zero. Lower probabilities (if the model thinks some other token is likely to come next) lead to higher losses.

Cross-entropy loss function

So in pretraining, by trying to minimize the loss, we maximize the probability of correctly predicting the next token. This is why LLMs are so good at predicting the next token. In fact, they are even better than humans at this.[3]

Instruction Finetuning

But this method alone — training an LLM to predict the next word on a large string of text — is not enough to make a chatbot. For example, if you asked GPT-3 to “Write an article about American football”, instead of writing an article, it might continue the sentence by predicting the most likely next tokens: “Write an article about American football and its influence on television in America.”

This is where instruction finetuning comes in, also known colloquially as instruction tuning.[1] This allows better performance on zero-shot learning, meaning you can get the model to perform a task just by telling it to do it, without the need to include examples of the task.

The way instruction tuning is actually done is by training on a new dataset of instructions separate from the much larger dataset used in pretraining. FLAN, an early example of instruction tuning, trained on about 250 million tokens during finetuning. In contrast, the pretraining that FLAN built on used 2.49 trillion tokens.[4]

The exact format of instruction tuning depends on the model. An example from Llama 3 is:[5]

<|start_header|>user<|end_header|>
Hi! I am a human.<|eot|>

<|start_header|>assistant<|end_header|>
Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant<|eot|>

Unlike in pretraining, the model is usually trained only on the completion (labeled assistant in this example), not the rest of the prompt. But other than that, instruction tuning is essentially the same as pretraining, just with a new dataset specialized for prompting.[1] The loss function is the same, meaning the model is still being trained to predict the next token.

Reinforcement Learning

So far, the model is just predicting the next token. First it learned to do so on a big dataset (pretraining), then it was fine-tuned on a more specific dataset designed for prompting (instruction finetuning). So is the claim “LLMs just predict the next token” true?

Even up to this point, you could make the argument that something deeper is happening. There's a lot of evidence that in order to predict the next word, models have to store detailed information about the world and its facts in their weights. Predicting the next word might be their loss function, but they have developed a rich internal world in the process. It's like saying humans are just gene-copying machines, ignoring all the complexity that humans have developed as a byproduct of evolution.[6]

But all that is still assuming that the objective an LLM is trained on really is next-token prediction. In fact, after pretraining and instruction finetuning, LLMs are trained on an objective that is fundamentally different from next token prediction. This is where reinforcement learning (RL) comes in.

To be more precise, LLMs are trained with something called reinforcement learning from human feedback (RLHF). It's debatable whether this is really RL (Andrej Karpathy says just barely, Yann LeCun says no), but the point is that it's a very different objective from next-token prediction. This has important implications for what the model is really doing.

At a high level, there are two steps to RLHF:[7]

We let the model produce many different outputs for various prompts. For each prompt, we ask humans to rank the outputs. (That's the human feedback in RLHF.) This is used to train a reward model that predicts which output humans will prefer.
We use this reward model to train the LLM to produce outputs that humans will like. (That's the reinforcement learning in RLHF.)

I won't go into all the details here (both the original paper and the RLHF book are great resources), but today we are focused on the question: do LLMs really just predict the next word? For this, what matters is the loss functions of each step.

Reward Modeling

The first step is reward modeling, where we train a separate model to predict the reward of an output: essentially, how good humans think the output is. The loss function for the reward model is $L (θ) = - \log (σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l})))$ This is simplified from the original formula,[7] but captures the core idea. In this formula, our input is a prompt $x$ along with a pair of outputs $y_{w}$ and $y_{l}$ , where $y_{w}$ is the output that the human labeler prefers and $y_{l}$ is the one they don't like as much. The function $r_{θ}$ is the reward model, which takes an output and returns a score. $σ$ is the sigmoid function.

If we graph the loss as a function of $r_{θ} (x, y_{w}) - r_{θ} (x, y_{l})$ (how much higher the given reward is for the winning output), we get a curve that looks like this:

Reward model loss function

We can see that as we assign higher scores to the output that humans thought was better, the loss goes down, close to zero as we assign much higher scores. On the other hand, if we assign lower scores to the output that humans actually thought was better, the loss goes up.

By training the reward model on labels provided by humans in this way, we end up with a model that can predict how much humans will like an output.

Proximal Policy Optimization

So now we have this function $r_{θ}$ for predicting rewards. That's cool, but what we actually wanted was to train the model. Now we can use the reward model in our loss function for training the actual LLM! At the same time, our LLM is already pretty good with instruction finetuning, so we'll try not to change it too much. This is the idea of proximal policy optimization (PPO).

The new objective function for the LLM is[7]

\begin{aligned} objective (ϕ) & = E_{(x, y) \sim D_{π_{ϕ}^{RL}}} [r_{θ} (x, y) - β \log (π_{ϕ}^{RL} (y ∣ x) / π^{SFT} (y ∣ x))] \\ + γ E_{x \sim D_{pretrain}} [\log (π_{ϕ}^{RL} (x))] \end{aligned}

This is a lot more complicated than the previous loss functions! I'll try to break it down step by step.

First of all, this is technically an objective function, not a loss function. So we are trying to maximize this function, not minimize it.

Let's start with the first term:

E_{(x, y) \sim D_{π_{ϕ}^{RL}}} [r_{θ} (x, y)]

Here, $(x, y) \sim D_{π_{ϕ}^{RL}}$ are prompts ( $x$ ) that we are using for RL training, and $y$ is the output that the model produced for that prompt. So far, our objective function is just the expected value over the training data of the reward model $r_{θ} (x, y)$ . So we are trying to maximize the reward predicted by the reward model that we trained earlier. Note that this is computed on an entire prompt and output, not the next token.

Onto the next term:

E_{(x, y) \sim D_{π_{ϕ}^{RL}}} [- β \log (π_{ϕ}^{RL} (y ∣ x) / π^{SFT} (y ∣ x))]

Again, $x$ is the prompt and $y$ is the model output. Now $π_{ϕ}^{RL} (y ∣ x)$ is the predicted probabilities of the current model we're training, while $π^{SFT} (y ∣ x)$ is the predicted probabilities of the base model we started from, after pretraining and instruction finetuning.

If we move around some notation, letting $p = π_{ϕ}^{RL} (y ∣ x)$ and $q = π^{SFT} (y ∣ x)$ , this is also

- β E_{(x, y) \sim D_{π_{ϕ}^{RL}}} [\log (\frac{p}{q})]

so we are taking the expected value of $\log (p / q)$ , and using that times $β$ as a penalty. This expected value is the Kullback-Leibler divergence (or KL divergence) between the two distributions $p$ and $q$ , which represents how different the two distributions are. By applying a penalty on the difference, we are making sure that as we train the model, its output probabilities stay somewhat close to what they were in the base model, when all we had done was pretraining and instruction finetuning.

Finally the last term:

γ E_{x \sim D_{pretrain}} [\log (π_{ϕ}^{RL} (x))]

This time, instead of using our RLHF dataset of prompts and outputs, we are going back to the pretraining dataset $D_{pretrain}$ . In fact, we are just predicting the next token here using the exact same loss function as before, $- \log p_{y}$ , only times a constant $γ$ . We add this term so that as we do RLHF, we maintain good performance on predicting the next token on the pretraining data.

Let's recap the meaning of the three terms of this loss function:

We try to maximize the reward given by the reward model we trained before. Hopefully, this means we are making outputs that humans will like.
We add a penalty for outputting a distribution that goes too far from the base model.
We mix in some normal next-token prediction on the pretraining data.

All this is called proximal policy optimization (PPO). Proximal because we are staying close to the base model, and policy optimization because the output probabilities of the model are called the model's policy in reinforcement learning.

Only one of these terms (#3) is directly training the model to predict the next token. Arguably, term #2 is also a proxy for next-token prediction, since we stay close to a base model that was trained on next-token prediction.

But term #1, the RLHF term, is fundamentally different from next-token prediction. This means that LLMs do not just predict the next token because they are literally not trained to just predict the next token.

LLMs as Chess Players

Imagine a chess-playing model like AlphaZero. If we ignore the details of tree search, the model takes in a chessboard and outputs a distribution over possible moves (its policy). The model is trained based on the results of games it plays, and over time, it gets better at playing games.[8]

This is the essence of reinforcement learning: we have an agent (the chess-playing model) that interprets the environment (the chessboard) and takes actions (moves on the chessboard) that in turn affect the environment. It tries to choose actions that maximize its perceived reward based on the environment. Its actions are expressed as a policy, which is a probability distribution over possible next moves.

Reinforcement learning diagram

Regardless of whether RLHF is truly RL, there is an important analogy we can draw here. Rather than next-token prediction machines, LLMs are agents that interpret their environment (the prompt and output so far) and take actions (the next token) that affect the environment. They try to choose actions that maximize their perceived reward (the reward model, which helps them produce outputs that humans like).

RL chess bots predict next moves as possible actions, just like LLMs predict next tokens as possible actions.

This is a consequence of how we trained the LLM in RLHF. The part that doesn't change, no matter how we train the LLM, is its input space (strings of tokens) and output space (distributions over tokens). We can interpret this in so many different ways:

During pretraining, an LLM becomes an agent that tries to take actions (next tokens) in order to predict the next token. At this point, it's just a statistical model.
During RLHF, an LLM becomes an agent that tries to take actions (next tokens) to ultimately produce entire outputs in a way that (indirectly, via a reward model) appeals to human judges.
During chain-of-thought RL training like in DeepSeek R1,[9] an LLM becomes an agent that tries to take actions (next tokens during both reasoning and output) to ultimately produce outputs that are more likely to be correct.

From here, we can imagine LLMs that write code and evaluate it by running it, or even solve math problems and evaluate their solutions with proof assistants.[10] On the more sinister side, we can imagine LLMs that are rewarded for spreading misinformation.

In short, to the extent that words matter, LLMs can be thought of as agents that take actions rather than just statistical models, even if those “actions” are words.

Why AI Agents?

If LLMs are already agents, what's the buzz around “AI agents”?

LLMs alone are RL “agents” whose actions are limited to producing tokens. By mapping tokens to real-world actions, we can make the agent-like behavior already baked into LLMs into something even more tangible.

Remember that the reward function in RLHF is a proxy for how much humans like the output of an LLM. So an LLM is already trained to generally follow instructions in a way that is appealing to humans. This can easily be extended to all sorts of actions that an LLM can take.

As is, the model is mainly trained to appeal to human judges. Code is a clear example where an agent could instead be trained to code with RL, rather than assuming that whatever code the evaluators tend to like is correct. Prompt engineering can help with this, but there's a lot of room to improve through actual training, especially in clear-cut domains with easy feedback loops like writing code for well-defined problems.

Additionally, the proxy goal of pleasing human evaluators is great, but it's not perfect. LLMs can fool humans, creating outputs that seem good when they are actually flawed. In fact, this has already been demonstrated: when training on question answering and code generation, one team found that RLHF made the model produce outputs that evaluators liked better over time, even though it actually got worse at the task itself.[11]

It's not easy to find a better technique, though: reinforcement learning of all kinds is very prone to reward hacking, where models learn to exploit the reward function without really improving at the task.

Although RLHF isn't perfect, it is extremely powerful. It's surprising that a “mindless next-token prediction machine” can appear to show intelligence, but if we reframe an LLM as a machine that aims to appeal to humans through producing tokens, this starts to make a lot more sense.

It's still true that next-token prediction is an important part of LLMs, not only in pretraining but even as a component of RLHF. But I hope I was able to explain why there are much deeper layers to how an LLM works, first with RLHF, and second with other kinds of RL like chain-of-thought reasoning.

Fundamentally, an LLM is not a next-token predictor. It's actually something even more basic: a machine that outputs tokens. We can choose whether we train that machine to predict the next token, appeal to human evaluators, write code, or do something else entirely. And we can choose whether we simply display those tokens to a user or use them to call functions and create effects in the real world. It's up to us to make the best choices.

Can Transformers Solve Everything? — Looking into the math and the data reveals that transformers are both overused and underused.

References

Reinforcement Learning from Human Feedback (Nathan Lambert, 2024)
CrossEntropyLoss (PyTorch Contributors, 2024)
Language models are better than humans at next-token prediction (Buck Shlegeris, Fabien Roger, Lawrence Chan, & Euan McLean, 2022)
Finetuned Language Models Are Zero-Shot Learners (Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, & Quoc V. Le, 2022)
Fine-tuning Llama3 with Chat Data (torchtune Contributors, 2023)
My objection(s) to the "LLMs are just next-token predictors" take (Alejandro Tlaie Boria, 2025)
Training language models to follow instructions with human feedback (Ouyang et al., OpenAI, 2022)
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, & Demis Hassabis, DeepMind, 2017)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025)
AI achieves silver-medal standard solving International Mathematical Olympiad problems (AlphaProof and AlphaGeometry teams, 2024)
Language Models Learn to Mislead Humans via RLHF (Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, & Shi Feng, 2024)

Why AI Can't Do Your Laundry

2025-01-16T16:38:07Z

Here are some things AI can do:

Write a sonnet or a short story
Create realistic or abstract artwork
Code a passion project
Teach students new concepts

One can argue AI does all or most of these things poorly, which is a fair argument. But ultimately, AI can do all this at some level, and is only improving over time.

Now, here are some things AI can't do:

Lay bricks
Clean bathrooms
Solve world hunger
Distribute mosquito nets
Do the dishes
Do laundry

There's a pattern here. It's crazy, and a little upsetting, that AI can write poetry but can't do your laundry. In general, AI is able to do things humans want to do, but not things humans don't want to do or can't do. A lot has been said on this topic with respect to robotics, and why automating manual labor in many ways is harder than building a language model that reasons or solves difficult math problems. But I want to tackle a slightly more general problem in this post: Why does AI do the things we want to do, and not the things we don't want to do or can't do? How can we fix this?

But hidden in this line of reasoning are three assumptions: first, that manual labor is more difficult to automate than creative tasks; second, that manual labor is desirable to automate while creative tasks are not; and finally, that manual labor and white-collar knowledge work represent the two main job types for AI to do. All of these assumptions are up for debate, and are less true than they seem.

Can AI Do Your Laundry?

When people say “AI” these days, they usually mean “deep learning with lots of data”. If you use that definition, then no, AI can't do your laundry.

But if you ask whether technology can do your laundry, the answer is yes. Laundry machines and dishwashers have been incredible for humanity: they are able to do these chores thoroughly while using less energy than manual cleaning.[1] Before the invention of appliances like these, people, especially women, would often stay at home and do these chores. In this case, automation actually massively increased workforce participation, allowing women more freedom to work instead of spending all their time on the household.[2]

Advertisement in an 1896 issue of McClure's Magazine for The Faultless Quaker Dishwasher. Image source: Wikimedia, Public Domain.

Now, the tasks of doing laundry and dishes mainly involve moving clothes or dishes to the appropriate places and pressing a few buttons. Deep learning (DL) could improve this if we had robots that could scrape food from plates and place them in the dishwasher, or fold and hang clothes from the dryer, but the current situation is already pretty awesome.

So maybe AI, or rather technology, can already do your laundry.

Besides laundry and dishwashing, just as the majority of human work throughout history has been physical labor, so has the majority of automation. It's true that increased literacy rates made scribes obsolete, and calculators largely replaced human computers. But we also see that the Industrial Revolution displaced textile workers, modern manufacturing displaced blacksmiths, and we even had alarm clocks succeed human “knocker-uppers”.

Human computers in the NACA High Speed Flight Station "Computer Room", Dryden Flight Research Center Facilities, summer 1949. Image source: Wikimedia, Public Domain.

It's only in the most recent wave of AI that this trend seems to have been reversed. Deep learning techniques have a much easier time with human knowledge work than physical labor. They can play chess and other games at a superhuman level, and can even solve unsolved math problems. But they can't reliably handle unexpected events on the road[3] or even bring you a cup of tea[4].

This is known as Moravec's paradox: things that seem hard to us (like reasoning) are easier for computers, while sensorimotor and perception tasks we do effortlessly are very hard to replicate. You might think the difference between these tasks and the more “advanced” reasoning tasks is one of hardware, where building the right hardware to support the intelligence is a difficult problem. But the difficulty is actually (mostly) on the software side: humans can control robots to perform these tasks quite easily.[5]

One of the main issues is that there isn't enough high-quality data for robots to train on.[5] There are other issues too, but without solving this, a DL approach to robotics has no hope of matching what we've seen in other AI models. In contrast, large language models are able to train on the entire internet and then some, and can include both images and text in their training.

This is the crux of the problem, the fundamental reason why AI can't do physical labor tasks we want to automate and instead does creative work we want to do ourselves. The internet is full of everything from TikTok dances to classic works of philosophy, but I've never come across first-person video footage of doing the dishes complete with muscle movement data. The internet is primarily a place for people to share knowledge work and their passions. By training AI on the internet, we are largely training AI to mimic human creativity and ingenuity, while leaving out the drudgery of life.

There are two ways to solve this problem: either collect more data, or create it. To create data, we can use reinforcement learning (RL). We can train robots in simulation, or even in real life, to maximize some reward we set. But this is vulnerable to reward hacking: we've seen this in games like CoastRunners, a boat racing video game, where an AI got a high score by knocking over targets for points instead of completing the race.[6] In a more dangerous example, a robot trained to make you tea and nothing else might knock down doors, spill hot water, and break cups, as long as it achieves its objective of making you tea.

There is also the option of collecting data outside the internet. We could use humans performing the task as a starting point, rather than training a pure RL robot. But it's expensive to get all that human input, and it doesn't work for tasks that are too dangerous or difficult for humans.

And if we take off our AI hype glasses for a second, we might question the use of deep learning at all for this purpose. Robotics startups often don't use DL,[7] which makes sense because DL is a statistical technique with room for error, while robotics needs perfect reliability. Rather than pure RL/DL, we might see a growth in non-DL approaches or, more likely, a mix of both.

I'm looking forward to seeing some great advances in robotics in the coming years, and maybe even a “ChatGPT moment”. But these obstacles will likely continue to limit the progress we can make in robotics, while we'll see more advances in AI for creative work and knowledge work as a consequence of the kind of data we see on the internet.

Should AI Do Your Laundry?

The dominant narrative is that AI should automate manual labor rather than creative and stimulating tasks. For the most part, I agree. But this isn't as clear-cut as it seems.

The word Luddite, now used to describe anyone opposed to new technology, originates from a group of textile workers in 19th-century England who felt threatened by the new technologies of the Industrial Revolution. They had spent years refining their craft to produce high-quality garments, and suddenly people with minimal expertise could churn out textiles with new machines. Eventually, the Luddites became so upset that they started smashing textile machines across England, which the government made punishable by death.[8] Clearly, they were not happy to have their job automated, even though it was more physical work than knowledge work.

It's true that unlike dishes or laundry, manual textile work requires a lot of specialized training and artistry on top of the physical component. But this is true of many jobs involving physical labor: plumbers, mechanics, electricians, and welders all require a lot of training to refine their skills, while trades like carpentry are arguably even art forms. Even if this weren't the case, the important fact here is that people in the skilled trades tend to be satisfied with their careers: over 90%, in fact.[9] Automating away their livelihoods is no more desirable than automating that of a spreadsheet worker.

A tractor, an example of a machine that automates much of a traditional farmer's job without displacing the farmer. Image source: Unsplash.

Knowledge work versus blue-collar work is a useful distinction to see what AI is most likely to automate, but it is not as useful to determine what AI should automate. For that, we need to start with a simpler question: Do people want to do this job?

If we look at the careers with the lowest levels of happiness,[10] we see a mix of blue-collar and white-collar jobs. While we hear about fears of AI automating artists or programmers on social media, many actual candidates for automation with current AI are things we find on this list: data analyst, customer service representative, administrative assistant. Other blue-collar roles, like cashier and retail salesperson, are not being automated per se but are simply shrinking: we have more self-checkout and online shopping now.

When we say we want AI to do our laundry, we don't necessarily mean that we want it to automate blue-collar jobs instead of white-collar ones. Instead, we want it to automate the boring parts of life that don't make us money or give us fulfillment. This is good for the economy and human livelihoods all around: we saw that the invention of new household appliances increased workforce participation, so surely AI for similar purposes will increase both economic productivity and human flourishing. What if we could have AI for cooking, cleaning, and taking out the trash?

Besides Laundry, What Should AI Do?

What AI should do and what it will do aren't always the same. It seems to me that unfortunately, AI will automate away the jobs of some people who wanted to keep those jobs. It might not happen as explicitly as in the case of the Luddites, but the shift in economic incentives won't be 100% good for everyone. There will be harm, as with any new automation.

When faced with the possibility of AI harming us, we can and should ask how we can make AI serve us as regular people. An obvious candidate is chores like cooking, cleaning, and taking out the trash. Having robots that can do these non-career chores would free up people to work more and enjoy more, just like laundry machines did in the past.

But there's a lot more to AI than chores. So far, I've mostly been talking about using robotics to automate things that humans don't want to do, like chores and jobs they dislike. But equally important is AI automating things humans can't do, at least not feasibly. Unlike something like AI autocompleting a line of code or designing a logo, this would be a real step towards using AI to build a world of great prosperity far beyond what humans can achieve on their own.

The internet is full of things humans have done, so it seems like a bad place to train an AI to do things humans can't feasibly do themselves. But the key word in that sentence is things humans can't do feasibly. Customer service is a great example: humans can provide customer service over the phone, but it's not feasible to provide it at all hours, for all companies, for all customers who might have a complaint. With large language models, this can be automated at scale, ultimately using human-created data from the internet to accomplish something humans couldn't feasibly organize to do. Companies like Bland AI are already working on this.

Another example is upsampling: loosely speaking, taking an existing image, video, or audio and filling in the blanks in some way. In 2023, the surviving Beatles used AI to restore John Lennon's vocals from old recordings and release the last ever Beatles song, “Now and Then”.[11] We can imagine the general principle of AI upsampling for things like remastering old songs, animating and colorizing historical photos, and cleaning all kinds of noisy data for business and engineering purposes.

Then there's science experiments. Science labs are expensive, and running experiments is still costly even if you have one. AI can provide at least reasonably accurate experimental results without needing to run a real experiment, allowing scientists to refine their theories more quickly while running experiments more selectively. Technologies like AlphaFold demonstrate this, and promise to speed up research that could save lives.

An illustration of protein structure, which AlphaFold aims to predict. Image source: Holger87, Wikimedia, CC BY-SA 3.0.

Finally, one of the main purposes of machine learning in general is prediction: given some past data, we use statistical methods to predict what comes next. New machine learning methods are being used for forecasting in finance, weather, and even early natural disaster warnings. No human can look at a giant spreadsheet and come up with these predictions, not even with Excel formulas, so this is another case of using AI to expand upon rather than replace human potential.

Unlike customer service or upsampling, we need specialized datasets for many of these applications like automating science experiments and forecasting: we can't just dump the internet onto a model. But so far, we've had encouraging successes in these fields.

Too often, the debate over AI automation focuses only on using AI to replace human activities, some we like and others we don't. It's easy to forget that AI can also help us by accomplishing things that humans just can't do, just like earlier computer programs from calculators to chess bots.

AI for Human Potential

AI doesn't have to be all about replacing jobs people love and leaving them with nothing. Ideally, we can use AI for two main reasons: first, automating jobs people hate, and second, doing things people can't do without AI.

Even when automating jobs people hate, like customer service, there is the risk of displacing livelihoods. But if companies make more money, they will want to use the money to sell more; and in order to sell more, they will hopefully hire people for new jobs. These new jobs should have better working conditions and more fulfillment than the ones that are being automated. We've often gotten this from automation in the past: with 3D printing, a CAD engineer has better working conditions and probably a more fulfilling job than a factory worker. If we're careful, we can replicate this kind of good automation with AI as well.

Automation is the most obvious possible result of AI: we have a new thing with some kind of intelligence, and it's natural to think about how it might augment or replace our own intelligence in roles we already perform as humans. But it might be better to instead think about the more hidden ways that AI can do things we can't do, like upsampling, forecasting, and predicting the results of science experiments.

We can and should build AI that increases human potential instead of replacing it. We should, but it's not yet certain that we will: if we (and policymakers) play our cards wrong, AI starting to automate human labor for corporations without creating new jobs to compensate is a real possibility. How it goes is still up to us.

Can Transformers Solve Everything? — Looking into the math and the data reveals that transformers are both overused and underused.
LLMs Do Not Predict the Next Word — RLHF forces us to view LLMs as agents in an environment, not just statistical models.
The Pipe Room — Can a computer ever be conscious like we are?

References

Handwashing vs Dishwasher (Reckit Benckiser)
How the appliance boom moved more women into the workforce (Jeremy Greenwood, “Evolving Households: The Imprint of Technology on Life”, 2019)
The main reason why self-driving cars are not ready for prime time (Khristopher J. Brooks, CBS News, 2024)
The Tea test of robot intelligence (Alan Winfield, 2019)
Common misconceptions about the complexity in robotics vs AI (Dan Ogawa Lillrank, 2024)
Faulty reward functions in the wild (Clark & Amodei, OpenAI, 2017)
Why Robotics Startups Don't Use Deep Learning (Think Autonomous, 2022)
Who Were the Luddites? (Evan Andrews, History, 2023)
New Angi Report Finds Nearly 90% of Skilled Tradespeople Satisfied in Their Careers (Angi, Inc., 2024)
10 Careers Reporting the Lowest Levels of Happiness (Andrea Moran, The Washington Post, 2024)
Listen to ‘Now and Then’ by The Beatles, a ‘new’ song recorded using AI (Andrew Paul, Popular Science, 2023)

Predicting Every Election Since 1916

2024-11-08T16:38:07Z

In just 91 lines of C++ code, I perfectly predicted every United States presidential election since 1916. That's 28 straight elections, counting the most recent one in 2024.

The crazy part is I didn't rely on any complicated polling data trends, voter sentiment, or policy analysis to make these predictions. I just used basic principles of probability.

The US presidential election results in 1916. Public domain. By AndyHogan14, Wikimedia.

Alright, I'll admit I cheated a little. But arguably not much more than the political pundits that claim to have predicted every election since, say, 1980.

Every election cycle, you see stories on the news of someone who has correctly predicted every election in however many years. Most recently, I saw stories about Allan Lichtman, who correctly predicted most of the 11 elections from 1984 through 2020. His system for predicting elections is called the “13 Keys”, and consists of 13 true/false questions to predict the winner of the election.[1]

But then Allan Lichtman got the 2024 election wrong. Does this cast doubt upon election pundits who claim to have sophisticated election prediction systems?

In this article, I'm going to show you how you, too, can predict every single election in over 100 years. You can do this with a very simple deterministic system that requires even less information than the 13 keys, and yet is more accurate, as long as you're willing to be fooled by statistics!

I'll also explain why, mathematically, the seemingly insightful achievement of predicting election results actually means very little.

How is This Possible?

How is it possible to compute every single election since 1916? Surely it couldn't happen by random chance. After all, there have been 28 elections since 1916, inclusive. Each one has had at least 2 major candidates, and a few of them actually had 3. So the probability of guessing all 28 elections correctly purely by chance is less than $1 / 2^{28}$ , which is about 1 in 300 million.

But wait: 300 million? That's a familiar number: the population of the United States is a little over 300 million. So if everyone in the United States guessed the election results at random for every election since 1916, we would expect about one of them to guess every single outcome correctly. This person would be praised by the country as a masterful political pundit, and everyone would eagerly await their prediction for the next election… even though it would have only a 1/2 chance of being correct!

Of course, very few, if any, Americans today have been alive to predict elections since 1916. And few Americans make public election predictions for the world to judge. So let's try an argument with slightly more realistic numbers.

Let's say there are 2000 Americans who are potentially in the business of predicting elections, and who are of age to have seen all elections from 1984 through 2024 (that's 12 elections). Each one has some kind of a system based on polling data, economic trends, and other factors, giving them a 60% chance of being correct in any given election. Then the chance that any given predictor gets all 12 elections correct is $(0.6)^{12}$ , or about 0.2%. The chance that at least one predictor of the 2000 gets all 12 elections right is $1 - (1 - 0.002)^{2000}$ , or 98.7%!

If we allow more than 2000 predictors, or more than 60% accuracy, this probability gets even higher.

This assumes that all predictors are independent, which certainly isn't the case: all of them use much of the same underlying data. But even without the independence of predictors, 98.7% odds with just 2000 predictors is a high number. This indicates that it's quite possible for someone to be right on almost all elections, despite not having a very accurate underlying model.

How Likely is This?

Let's look deeper into this model of everyone in America guessing randomly.

In just one election, you have a 1/2 chance of being right. As you increase the number of elections, your chance of being right on all of them drops off exponentially. But your chance of being right on many or even most of them remains fairly high for quite a while after.

In general, the probability of getting $k$ out of $n$ elections right by guessing randomly is given by the binomial distribution: $P (k) = (\binom{n}{k}) {(\frac{1}{2})}^{n}$ The $(\binom{n}{k})$ factor keeps our probability high for medium numbers of $k$ .

From the graph, once we reach 12 elections (1980–2024), we still have a 1.5% chance of getting just 2 elections wrong from guessing randomly. So this outcome is very much possible, especially when lots of people try to guess the elections, and when they do just a little better than guessing randomly. But eventually, with a large number of elections, you are almost guaranteed to get more than 5 wrong.

We can expand this random guessing model to 300 million Americans, using $P_{300 M} (k) = 1 - (1 - P (k))^{300, 000, 000}$ That is, $P_{300 M} (k)$ is the chance that at least one person out of 300 million guesses exactly $k$ elections correctly.

All the way up to about 30 elections, there's a decent chance that someone will guess every single one correctly, just randomly! And we have decent numbers all the way into the 50s, where we might get just 5 elections wrong. Of course, past 5 elections, it's almost certain that someone of the 300 million gets more than 5 elections wrong.

Predicting Every Election Since 1916

Now it's time to predict every single election since 1916. The algorithm is very simple:

For each election, determine who the top contenders are. Generally there are only two top contenders in a US election; a few times, there have been three, four, or even one. I assume it's easy to figure this out.
Sort the contenders in alphabetical order by last name.
Flip a coin. Guess heads for the first contender, tails for the second. (Or maybe use a three-sided die for 3 contenders, etc.)

And that's basically it.

But there's one key thing about the coin. It can't be a physical coin. You have to use a pseudorandom number generator in a computer.

In fact, you have to use C/C++ random number generation. Seed it with the random seed 824050438, and then start picking random values. (Use modulus on each random value to pick the actual candidate.) If you go and check this algorithm with this seed, you'll be amazed to find that you can predict every single election from 1916 to 2024 correctly!

But wait, isn't that cheating?

Yes, choosing a random seed that I know works perfectly is cheating. But hardly more so than having multiple people predicting the election, and declaring a political pundit only when at least one gets most of the elections right, just as I declare an optimal seed when at least one gets all the elections right. It's just a matter of cheating at the individual level versus the societal level.

Let's make a toy model in Python. You can find the full code, as well as the more efficient C++ version, on GitHub.

First we set up and preprocess our dataset. In this case, it's the list of all main contenders in US elections, and who the winner was in each case.

elections = [ # list the winner first
  [1789, ["Washington"]],
  [1792, ["Washington"]],
  [1796, ["Adams", "Jefferson"]],
  [1800, ["Jefferson", "Adams"]],
  [1804, ["Jefferson", "Cotesworth"]],
  ...
  [1856, ["Buchanan", "Frémont", "Filmore"]],
  [1860, ["Lincoln", "Breckinridge", "Bell", "Douglas"]],
  [1864, ["Lincoln", "McClellan"]],
  [1868, ["Grant", "Seymour"]],
  ...
  [1996, ["Clinton", "Dole"]],
  [2000, ["Bush", "Gore"]],
  [2004, ["Bush", "Kerry"]],
  [2008, ["Obama", "McCain"]],
  [2012, ["Obama", "Romney"]],
  [2016, ["Trump", "Clinton"]],
  [2020, ["Biden", "Trump"]],
  [2024, ["Trump", "Harris"]]
]

for e in elections: # preprocessing
  sorted_names = sorted(e[1]) # sort alphabetically
  result = sorted_names.index(e[1][0]) # index of the winner, in alphabetical order
  e.append(len(sorted_names))
  e.append(result)

Now let's simulate randomly guessing elections 1 million times.

import random

TRIALS = 1e6 # 1 million

def simulate_elections(seed):
    # guess randomly using a given seed for all elections
    random.seed(seed)
    correct = 0
    for j in range(len(elections)):
        result = random.randint(0, elections[j][2]-1)
        if result == elections[j][3]:
            correct += 1
    return correct

max_correct = 0
best_seed = -1

for i in range(int(TRIALS)):
    correct = simulate_elections(i)
    if correct >= max_correct:
        max_correct = correct
        best_seed = i

print(f"{max_correct}/{len(elections)}")

This code runs in 20 seconds. The best seed comes out to 824728, with 48/60 elections correct. But can we do better? Can we get every single election correct?

We'll start by limiting ourselves to the last 28 elections (1916–2024). The code now runs in 13 seconds and gets 26/28 elections correct with the seed 787252. Getting better!

In order to improve from here, we need an improvement in processing power. My C++ code, which I won't include here, runs on essentially the same principle but adds multithreading. This allows me to run 3000 simulations on our dataset in parallel, speeding up this process tremendously.

In C++, I manage to get 28/28 elections correct using the seed 824050438, which takes 20 seconds to find.

Remember 20 seconds is just the time to discover this seed. Once we have the seed, we can technically compute election results almost instantly without knowing the results in advance! All we need is the list of top contenders in each election. We stuff in our seed and all the results will fall out perfectly.

So there you have it: a deterministic algorithm to perfectly predict every US presidential election since 1916!

This kind of accuracy is a crystal ball, the likes of which has not been seen in any election predictor in American history. Given this immense level of insight, you might be wondering who will win the 2028 US presidential election. Assuming a race between a Democrat and a Republican in 2028, the magic random seed 824050438 predicts… whoever's last name is first in alphabetical order. You heard it here first. Don't be surprised if I'm right!

Takeaways for a Scientist

What's the takeaway of this experiment in a scientific context, especially data science?

At first, my takeaway was not to extrapolate past model performance to future performance. After all, hindsight is 20/20. See this relevant XKCD: “Electoral Precedent”.

XKCD 1122: Electoral Precedent. By Randall Munroe. CC BY-NC 2.5

But I don't think that's exactly what we should take away from this. If a model does well on 2000 cat versus dog predictions, I think it's a safe bet that it'll also do quite well on the next 50, even if the future data has some important differences.

Instead, I think the more relevant insight here pertains to extrapolating model performance from small datasets. When a model has done well on a small dataset, we don't have enough evidence to predict its future performance. The US presidential election dataset is quite small: there have only been 60 as of 2024. Most well-known election predictors only try their hand at around 10, and that too imperfectly!

Another takeaway is always use a baseline before trusting your metrics. If you don't have at least a random chance baseline for your predictions, if not a more sophisticated model, good performance isn't always an indication that you're doing something right. This is a common mistake in machine learning, where people have the tendency to build deep learning models for simple datasets that work quite well, but ironically still worse than linear regression.

And how about the takeaway in a political context? I'm not saying that these political analysis models are completely baseless, like a random number prediction based on the candidates' last names. I'm sure they have better than 50% odds because they genuinely take important information into account.

But I am saying that we should be skeptical when we hear claims of any one person or method being able to consistently predict election results—especially if they get a few wrong, because the probability of getting most but not all correct by pure chance is significant. We should evaluate the methodology further before assuming its accuracy.

So my overall takeaway is that as a scientist, you should avoid extrapolating performance from small datasets, and always use a baseline before trusting your metrics. And as a citizen, don't believe everything the election pundits tell you: for all you know, they could be flipping coins off camera!

Mathematical History — Why we need more historians doing math

References

The GitHub for this article, including figures, is at crackalamoo/blog-demos.

Allan Lichtman (Wikipedia)

Can Transformers Solve Everything?

2024-10-30T16:38:07Z

Transformers are best known for their applications in natural language processing. They were originally designed for translating between languages,[1] and are now most famous for their use in large language models like ChatGPT (generative pretrained transformer).

But since their introduction, transformers have been applied to ever more tasks, with great results. These include image recognition,[2] reinforcement learning,[3] and even weather prediction.[4]

Even the seemingly specific task of language generation with transformers has a number of surprises. Large language models have emergent properties that feel more intelligent than just predicting the next word. For example, they may know various facts about the world, or replicate nuances of a person's style of speech.

The success of transformers has made some people ask the question of whether transformers can do everything. If transformers generalize to so many tasks, is there any reason not to use a transformer?

Clearly, there is still a case for other machine learning models and, as is often forgotten these days, non-machine learning models and human intellect. But transformers do have a number of unique properties, and have shown incredible results so far. There is also a considerable mathematical and empirical basis for why we should expect this success to continue.

The real question, then, isn't “can transformers solve everything?” Instead, it's “why shouldn't they solve everything?” There are a few reasons why not:

There are computational constraints: we might not have strong enough computers yet, or ever, to compute some things.
Also, especially in a machine learning context, there are data constraints: we might lack the dataset to feasibly train a model that is able to compute what we want.
Finally, there are algorithmic constraints: transformers might not be the right architecture or algorithm for what we want to do. We might even need to invent the right algorithm.

In this article we'll look at all these constrains, including a cool demo comparing transformers to the classical fast Fourier transform algorithm for time series.

Are transformers the one architecture to rule them all? A depiction of J. R. R. Tolkien's One Ring from Peter Jackson's films. Image source: Peter J. Yost / Wikimedia, CC BY-SA 4.0.

This Isn't the First Time

On Kaggle, XGBoost tends to win competitions on structured data while various kinds of neural networks dominate unstructured data competitions.[5] For quite a while now, people have been asking if model X is the model to end all models, a model that can solve everything.

Surprisingly, these wild claims are actually backed up by solid math. Most striking is the universal approximation theorem, which states that neural networks can approximate any continuous function to any degree of accuracy: you just need enough neurons and nonlinear activation functions. In fact, you can even do this in a single layer with enough neurons.[6] This idea goes back at least to 1989.[7]

If we go back even further, we find the Church-Turing thesis from the 1930s. Informally, this is the idea that anything that can be computed can also be computed by a Turing machine.[8]

This holds up for transformers too. It turns out that subject to some constraints, transformers are theoretically capable of approximating any sequence-to-sequence function.[9] This means that with a Turing machine in Python and a neural network in PyTorch you can theoretically do any possible computation.

If that's the case, why haven't we computed everything yet?

This goes back to the constraints: computational constraints (not enough computational power), data constraints (not enough high quality data), and algorithmic constraints (using the wrong algorithm for a problem).

Scaling

Data, compute, and energy constraints on transformers, while not the same, are intimately tied together. Even if we have enough data to train a transformer model on a task, we might not have enough compute or energy to complete the training. All of these fall under the problem of scaling.

As discussed before, theoretically transformers can do many, many things. But getting transformers to do these things in practice generally requires scale. Often that's too expensive, and it's a better idea to use simpler models instead. Let's look at the future of scaling transformers to see exactly when transformers are a better option.

Large Companies

ChatGPT isn't a crazy algorithmic jump over GPT-2, which produced results like this. The difference is largely a difference of scale: more layers, more parameters, and more training.

Therefore, some argue that with enough scaling of transformers, we will reach a general transformer model that can do just about any thinking a human can: this would be artificial general intelligence (AGI). This could greatly reduce the need for other algorithms and, concerningly, possibly humans as well. We just need more data, more compute, and more time; or so the argument goes.

As for data, we probably still have a while to go. It's estimated that large language models (LLMs) have only trained on 1/30 of all data on the web, which is massive, but leaves a lot of room to expand. Additionally, it turns out that training LLMs on data that they themselves generated (synthetic data) can improve performance. So by letting LLMs improve themselves, we might have even more data than these figures would suggest.[10]

Compute itself also likely won't be a limiting factor for the largest companies. GPUs currently take up only a small fraction of all wafer production at TSMC (Taiwan Semiconductor Manufacturing Company), meaning we have lots of raw materials left over.[10] As demand increases, GPU production can be scaled up with it.

Instead, energy could be the bottleneck. To scale large transformer models 5000x, as is projected by 2030, the power demand just for a training run is projected to be 6 gigawatts. This is both extremely expensive and bad for the environment. Companies are looking into nuclear power to get around this, but there are various obstacles, especially regulatory concerns.[10]

Even when not training, running AI models will require significant power. Technologies such as Chain-of-Thought (CoT) reasoning in the new OpenAI o1 model could increase inference costs even further.

Smaller Companies

Smaller companies likely won't have the resources to train LLMs from scratch, and they also won't need them. If a small company needs to access a powerful transformer model from a large company, they can either pay for queries or host an open source model locally.

So far, these large transformer models have mostly been LLMs. But in the future, if transformers really can solve everything, we might see similar models for other kinds of transformers.

Some domain-specific transformers might also be less intensive to train, so startups can build and sell their own in-house. Already we see companies like Atmo using deep learning for weather forecasting. While transformers for language modeling are very intensive and slow, Atmo's model is actually faster and more accurate than the corresponding atmospheric physics simulations. If transformers can generalize to a wide range of domains outside language, we might see startups not only using large companies' LLMs, but also building their own niche and innovative transformers.

Domain-Specific Applications

Transformers can even do things we don't usually do with machine learning, like add and subtract numbers[11] and implement hashing algorithms.[12] If we use transformers to achieve some sort of AGI (whatever that means), then naturally we could do a wide range of tasks like these. Would this render other domain-specific models obsolete?

The transformer state of the art in arithmetic seems to be 99% accuracy on 100-digit numbers.[11] But there's an even better algorithm for adding and subtracting numbers. It takes minimal compute, requires no training data, can work with any size numbers, and has 100% accuracy. It's called... adding digits and carrying the extras.

There are many domains where speed, interpretability, and 100% accuracy really matter. A transformer will always fail here, even if technically it could produce a decent result. These include arithmetic, cryptography, and mathematical proof verification.

So it seems these areas are safe from the influence of transformers, right?

Not quite. It's true that transformer models are only the best tool for the job in a certain subset of cases. However, one of these cases may be the task of determining when and where to carry out more traditional calculations! For example, transformers are much better than other models at coding, so in theory they could simply write programs to solve tasks that transformers themselves are ill-suited for.

With the correct setup, they can do this in a collaborative feedback loop with more traditional tools. Consider Google DeepMind's AlphaProof. This system combines a pretrained language model with Lean, a proof assistant that can verify mathematical proofs. So rather than just stochastically spitting out a proof, the language model can make sure its proof is correct and adjust as necessary. AlphaProof won a silver medal on the International Mathematical Olympiad, one of the most difficult and prestigious mathematics competitions.

Algorithmic Constraints

What do we mean by algorithmic constraints? In general, this is the idea that a transformer trained on some data might not be the best algorithm we have to solve a given problem.

In fact, this might be a computational constraint in disguise: maybe a transformer can technically solve a problem, but the amount of data and compute required is far more than with a more specialized algorithm. Let's take a look at one such case.

Demonstration: Bad Algorithm means More Compute

We know transformers are expensive. But how much more expensive is a transformer, really? Let's test this out by simulating a noisy time series and using two methods to pick out the signal: the fast Fourier transform (FFT), a well-known tool for this job, and a transformer model.

We'll use 5000 total data points of the signal $\sin (x) + \frac{1}{5} \cos (\frac{11}{13} x) + \frac{1}{9} \sin (\frac{17}{37} x - \frac{π}{4}),$ plus some Gaussian noise with standard deviation $\frac{1}{7}$ . We'll split this into 90% training data and 10% test data.

Our transformer will be decoder-only, with an input size of 25 data points, a hidden dimension of 8, a feedforward dimension of 4, 1 attention head, and 1 layer. We'll train for 1 epoch with a batch size of 128, using the Adam optimizer with a learning rate of 0.1. Writing that out feels ridiculous for a simple time series task like extracting a signal, but here we are.

Wow... the FFT did ok, but the transformer is absolutely horrendous! Let's try decreasing the learning rate to 0.01?

Getting a little better. Let's try training for 20 epochs instead of 1.

The transformer's predicted frequency and amplitude are a little too low, and it's not as smooth as the FFT solution, but now at least it's got the spirit. As for quantitative performance, the FFT had a root mean square error of 0.24, while the transformer had 0.88. The FFT is doing much better, especially considering that due to our 1/7 random noise, we wouldn't expect to get below 0.14.

Of course, the FFT also wins on training time. The FFT took just 0.11 seconds to compute on all training data, while the transformer took 18.9 seconds to train all 20 epochs. All that for worse performance.

The FFT has another benefit too: interpretability. We can look inside and see the amplitudes of all the frequencies that the FFT picked up.

If we look back to the real equation, this is pretty spot-on.

In the real world you might not have a dataset like this where a traditional non-machine learning model is obviously the perfect choice rather than machine learning or a transformer. Machine learning shines where we have the data, but we don't even know where to start when it comes to coding an algorithm. And neural networks like transformers shine even more in cases of unstructured data where we don't have good ideas to compute our own features. But hopefully this toy example demonstrates the universal approximation theorem, as well as why it isn't always a good guide in practice. To match or exceed the FFT performance with a transformer, we would need a lot more data and compute. Just because transformers can do everything doesn't mean they should.

Transformers aren't the final frontier either. There are exciting architectures like Mamba on the horizon that could one day replace transformers.[13] Most likely, they will still not replace traditional methods for similar reasons to transformers.

Bad Data means Bad Algorithm

Just as algorithmic constraints are sometimes compute constraints in disguise, we can also look at some algorithmic constraints as being data constraints in disguise.

What I mean is this: if we want to solve a problem with transformers, or any sort of machine learning, we usually start with a dataset. It's possible that no matter how good we get at inference in the dataset, even if we have a cool machine learning algorithm at the end of it, maybe we didn't actually solve the problem we had to begin with. In other words, winning a Kaggle competition for cancer detection does not mean you cured cancer.

Going with the Kaggle example, why is this the case? The answer, of course, is that detecting cancer is just one small part of curing cancer. Because of the targets and labels of our dataset, our machine learning algorithm will always be an algorithm designed only to detect cancer, no matter how much we refine it with more data.

This is a problem not just with transformers but with all of machine learning. There's just one exception: if we were to train a general machine learning algorithm, that can take any input and produce an appropriate output, this would not apply. Our algorithm would be able to detect cancer and cure it.

The question is whether this applies to LLMs. Is predicting the next word really a proxy for intelligence and creative thinking, even up to the level of curing cancer? We've already seen that the simple next-word-prediction task seems to capture some amount of intelligence. But it remains to be seen how far this will go.

The internet does contain some data that requires reasoning: you might see a sentence like “Socrates is a man. All men are mortal. Therefore...” and have the LLM trained to continue. Human fine-tuning can refine this capability further than basic syllogisms. But maybe solving advanced reasoning through next-word-prediction will turn out to be intractable, and something more is required.

Conclusion

According to the universal approximation theorem, neural networks can approximate any continuous function to any degree of accuracy. This means that in theory, yes, transformers can solve everything time series. They might even work in other areas, like images.

However, they often come at a large computational cost, and might require more data than we have access to. Even in cases where a transformer could work, a traditional model often comes with both better performance and lower cost.

Transformers are still a powerful model. While they are mostly associated with LLMs, startups could do to build more domain-specific transformers as well.

In general, given enough data, transformers or other neural networks will eventually do a good job matching their dataset. But when training a transformer, or any machine learning model, sometimes good performance on the dataset isn't really a success. You may have to ask whether solving this dataset is really solving a problem for people. Just like solving a breast cancer dataset on Kaggle won't end breast cancer, it remains to be seen whether solving the next-word-prediction task will solve AGI. There are arguments both for and against the idea that it will.

Even if next-word-prediction doesn't solve AGI, it's at least a useful tool for coding. When combined with other systems, transformers can solve tasks more robustly, even mathematical proofs that require perfect rigor. But the use of a more traditional system is critical here.

So yes, transformers can solve everything. But they probably shouldn't.

Why AI Can't Do Your Laundry — Too often, AI does what we humans want to do, and not what we can't or won't do.
LLMs Do Not Predict the Next Word — RLHF forces us to view LLMs as agents in an environment, not just statistical models.

References

The GitHub for the transformers vs FFT demo is at crackalamoo/blog-demos.

Attention Is All You Need (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin, 2017)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit & Neil Houlsby, Google Brain, 2020)
Decision Transformer: Reinforcement Learning via Sequence Modeling (Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas & Igor Mordatch, 2021)
Scaling transformer neural networks for skillful and reliable medium-range weather forecasting (Tung Nguyen, Rohan Shah, Hritik Bansal, Troy Arcomano, Sandeep Madireddy, Romit Maulik, Veerabhadra Kotamarthi, Ian Foster & Aditya Grover, 2023)
Lessons from 2 Million Machine Learning Models on Kaggle (Vasyl Harasymiv, KDnuggets, 2015)
A visual proof that neural nets can compute any function (Michael A. Nielsen, Neural Networks and Deep Learning, 2019)
Multilayer feedforward networks are universal approximators (Kurt Hornik, Maxwell Stinchcombe & Halbert White, Neural Networks, 1989)
The Church-Turing Thesis (Stanford Encyclopedia of Philosophy, 1997–2023)
Are Transformers universal approximators of sequence-to-sequence functions? (Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi & Sanjiv Kumar, 2020)
Can AI Scaling Continue Through 2030? (Jaime Sevilla, Tamay Besiroglu, Ben Cottier, Josh You, Edu Roldán, Pablo Villalobos & Ege Erdil, Epoch AI, 2024)
Transformers Can Do Arithmetic with the Right Embeddings (Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild & Tom Goldstein, 2024)
Implementing an SHA transformer by hand (Andrew Gritsevskiy)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu & Tri Dao, 2023)

Connecting Covariance and Rotational Inertia

2024-09-04T16:38:07Z

The covariance matrix in statistics represents the covariances between a set of random variables. It has a number of interesting properties, such as:

It is symmetric (equal to its transpose).
Elements along its major diagonal are variances. The variance of a random variable is, loosely speaking, the average of a squared quantity.
Its other elements are covariances between two different variables, which are, loosely speaking, an average product of two variables centered on their expectations.

The inertia tensor in classical mechanics represents the “rotational mass”, or the difficulty of rotating an object around an axis. It also has a number of interesting properties, such as:

It is symmetric (equal to its transpose).
Elements along its major diagonal are, loosely speaking, the average of a squared quantity. Namely, squared distances from an axis weighted by mass.
Its other elements are, loosely speaking, an average product of two variables. Namely, the negative product of two coordinates, weighted by mass.

🤔

Is there a connection here? Or is this similar form just a coincidence?

There are many ways to ask this question. Today, I'll put it this way: is the inertia tensor also the covariance matrix of some set of random variables? And if so, what are those random variables?

Definitions

Let's start by defining the covariance matrix between $n$ random variables, which we will label $Y_{1}, \dots, Y_{n}$ . The elements of the covariance matrix are given by

{C o v}_{i, j} = E [(Y_{i} - E [Y_{i}]) (Y_{j} - E [Y_{j}])] = C o v [Y_{i}, Y_{j}],

where $E$ represents the expected value of a random variable and $C o v$ represents the covariance of two random variables. In the special case along the major diagonal, this definition becomes ${C o v}_{i, i} = E [(Y_{i} - E [Y_{i}])^{2}] = V a r [Y_{i}],$ where $V a r$ represents the variance of a random variable.

Now let's look at the definition of the inertia tensor $I$ of an object around a given point. Intuitively, the inertia tensor represents a kind of “rotational mass”: how difficult it is to rotate something. For example, $I_{y y}$ (also called $I_{22}$ ) measures the difficulty of rotating an object about the $y$ -axis. The off-diagonal elements have a related meaning: for example $I_{x y}$ (also called $I_{12}$ ) represents the fact that when you attempt to rotate an object around the $x$ -axis, it may acquire angular momentum around the $y$ -axis as well, depending on its shape.[1] As you might expect, the inertia tensor is proportional to mass: heavier things are harder to rotate.

The elements of the inertia tensor of a continuous solid object are given by $I_{i, j} = \int_{V} δ_{i j} | r |^{2} - x_{i} x_{j} d m,$ where

The indices $i, j$ represent the $x, y, z$ coordinates for $i, j = 1, 2, 3$ respectively.
The integral here is over the entire volume $V$ of the object.
$d m = ρ (x, y, z) d x d y d z$ is an infinitesimal mass element depending on the density distribution $ρ$ of the object.
$r$ is the vector from the point around which we are computing $I$ to the current integration point, so $| r |^{2}$ is the squared distance from the center point of rotation to the current point within the object.
$δ_{i j}$ is the Kronecker delta, indicating that we only include the $| r |^{2}$ term for elements along the major diagonal, where $i = j$ .

Since we are integrating over the volume of the object, this is actually a triple integral. This is reflected in the definition of $d m$ .

We can get rid of the Kronecker delta by expressing the on-diagonal and off-diagonal elements separately. Noting that $| r |^{2} = x^{2} + y^{2} + z^{2} = x_{1}^{2} + x_{2}^{2} + x_{3}^{2}$ , we find $I_{i, i} = \int_{V} x_{1}^{2} + x_{2}^{2} + x_{3}^{2} - x_{i}^{2} d m I_{i, j} = \int_{V} - x_{i} x_{j} d m (i \neq j)$ We have a bit of a fundamental problem here: the on-diagonal and off-diagonal elements of the inertia tensor are defined differently, which is not the case for the covariance matrix. However, at least in some cases, we'll be able to get around this.

Random Sampling by Mass

In order to compare the inertia tensor and the covariance matrix, it will help to reframe the inertia tensor in terms of random variables. Currently, we are integrating over the volume of an object with respect to a mass element $d m = ρ (x_{1}, x_{2}, x_{3}) d x_{1} d x_{2} d x_{3}$ . This is strongly reminiscent of the expected value of a random variable.

Let's imagine a random process in which we randomly sample a point from an object weighted by density, so denser regions are more likely to get picked. For uniform density, this becomes uniformly sampling from an object. Then our probability density function is a literal density function $ρ$ divided by the total mass $M$ of the object! After sampling a point, we get three random variables for the coordinates $X_{1}, X_{2}, X_{3}$ of that point.

Now, we can compute expected values in terms of these random variables. Consider, for example, the variance of $X_{2} + X_{3}$ . This is equal to the variance of $X_{2}$ plus the variance of $X_{3}$ , which we can express in integral form as

V a r [X_{2} + X_{3}] = \int_{V} ((x_{2} - E [X_{2}])^{2} + (x_{3} - E [X_{3}])^{2}) \frac{ρ}{M} d x_{1} d x_{2} d x_{3} .

To simplify things, let's place the origin of our coordinate system at the center of mass of our object. Then our expected value for each coordinate is zero: $E [X_{1}] = E [X_{2}] = E [X_{3}] = 0$ . We now have $V a r [X_{2} + X_{3}] = \frac{1}{M} \int_{V} x_{2}^{2} + x_{3}^{2} d m,$ where $d m = ρ d x_{1} d x_{2} d x_{3}$ .

Recall that

I_{11} = \int_{V} | r |^{2} - x_{1}^{2} d m = \int_{V} x_{1}^{2} + x_{2}^{2} + x_{3}^{2} - x_{1}^{2} d m = \int_{V} x_{2}^{2} + x_{3}^{2} d m .

From the previous two lines, $V a r [X_{2} + X_{3}]$ is exactly equal to $I_{11}$ divided by $M$ ! And without loss of generality, we also have $\frac{1}{M} I_{22} = V a r [X_{1} + X_{3}]$ and $\frac{1}{M} I_{33} = V a r [X_{1} + X_{2}]$ .

Since the major diagonals of the covariance matrix represent variances, we might be tempted to say $C o v (Y_{1}, Y_{2}, Y_{3}) = \frac{1}{M} I$ where $C o v (Y_{1}, Y_{2}, Y_{3})$ is the covariance matrix of $Y_{1}, Y_{2}, Y_{3}$ . Then if we define $Y_{1} = X_{2} + X_{3}$ , $Y_{2} = X_{1} + X_{3}$ , and $Y_{3} = X_{1} + X_{2}$ , the elements along the major diagonal are correct. Let's call these $Y_{1}, Y_{2}, Y_{3}$ our desired random variables, because they are physically interpretable in terms of our random coordinates $X_{1}, X_{2}, X_{3}$ and they also connect the covariance matrix to the inertia tensor. Physically, these represent all three distinct sums of two coordinates of a randomly sampled 3D point.

There's a problem though: if we define our random variables $Y_{1}, Y_{2}, Y_{3}$ in this way, are the off-diagonal elements of the covariance matrix also correct?

The Off-Diagonal Elements

With our desired random variables, the diagonal terms of the covariance matrix matched the inertia tensor! Now for the off-diagonal elements, we hope to find

C o v (Y_{1}, Y_{2}, Y_{3})_{i, j} = \frac{1}{M} I_{i, j} ⟹ E [(Y_{i} - E [Y_{i}]) (Y_{j} - E [Y_{j}])] = \frac{1}{M} \int_{V} - x_{i} x_{j} d m .

Since we chose to place our origin at the center of mass of the object, we have the expectation of all $X_{i}$ equal to $0$ . And since each $Y_{i}$ is a sum of $X_{i}$ terms, their expectations will also be zero. Then the above simplifies to $E [Y_{i} Y_{j}] = \frac{1}{M} \int_{V} - x_{i} x_{j} d m .$ Let's convert the integral on the right into an expectation.

\frac{1}{M} \int_{V} - x_{i} x_{j} d m = \int_{V} - x_{i} x_{j} \cdot \frac{ρ (x_{1}, x_{2}, x_{3})}{M} d x_{1} d x_{2} d x_{3} = E [- X_{i} X_{j}] .

So, with the definitions previously chosen for each $Y_{i}$ , we are forced to conclude $E [Y_{i} Y_{j}] = E [- X_{i} X_{j}]$ ! This is not true in general.

A cylinder, symmetric across the

x y

x z

, and

y z

planes. Image source: Grendelkhan at the English Wikipedia, CC BY-SA 3.0.

There is a way around this. First, let's force $E [X_{i} X_{j}] = 0$ . This isn't an unreasonable assumption. In fact, this assumption is true for symmetric shapes.

Consider an object that is symmetric across the $x$ and $y$ axes. Then $E [X_{1} X_{3}] = E [X_{2} X_{3}] = 0$ because for every $(x, z)$ pair, there is an equally probable $(x, - z)$ pair; and for every $(y, z)$ pair, there is an equally probable $(y, - z)$ pair. We can make $E [X_{1} X_{2}] = 0$ by adding another coordinate plane of symmetry.

Our next modification will be to sample three independent points rather than just one. Then $Y_{1}$ is the sum of $y$ and $z$ coordinates of the first point, $Y_{2}$ is the sum of $x$ and $z$ coordinates for the second point, and so on. This keeps the variances of all $Y_{i}$ the same while making them all independent, so their covariances become zero.

However, for asymmetric shapes, the assumption of zeros off the diagonals is violated in general. So it looks like our desired random variables won't always work for asymmetric shapes.

So far, we have discovered:

Take an object that is symmetric across at least two of the $x y$ , $y z$ , and $x z$ planes.
Randomly, independently, and identically sample three points in the object, with sampling weighted by density.
Let $X_{1 i}, X_{2 i}, X_{3 i}$ be random variables representing the $x$ , $y$ , and $z$ coordinates of each randomly sampled point number $i$ .
Define new random variables $Y_{1} = X_{21} + X_{31}$ , $Y_{2} = X_{12} + X_{32}$ , and $Y_{3} = X_{13} + X_{23}$ . That is, for the $i$ th point, let $Y_{i}$ be the sum of the two coordinates excluding the $i$ th.
The covariance matrix of $Y_{1}, Y_{2}, Y_{3}$ will be equal to $\frac{1}{M} I$ , where $I$ is the inertia tensor of the object around its center of mass. All off-diagonal elements of $I$ and the covariance matrix will be zero.

Asymmetric Objects

These definitions for each $Y_{i}$ don't work for asymmetric objects. But can we find definitions that do work? That is, given an asymmetric object of mass $M$ and its inertia tensor $I$ , can we define random variables $Y_{1}, Y_{2}, Y_{3}$ such that the covariance matrix of $Y = (Y_{1}, Y_{2}, Y_{3})$ is equal to $\frac{1}{M} I$ ?

First of all, the cop-out, direct answer is yes. If you have a set of random variables, their means, and their covariance matrix, you can quite easily sample values for those random variables. Many computing packages have this implemented. So given an inertia tensor $I$ , we can divide it by $M$ and then use the result as a covariance matrix to sample random variables with any means we like.

But this is no fun. What we really want is a physical interpretation: which three interpretable physical random variables have the inertia tensor (up to a scaling factor) as their covariance matrix?

Using Eigenvectors

There's a way we can interpret the inertia tensor of any shape, even asymmetric ones, as a covariance matrix. This is done by choosing a different coordinate system in which that the inertia tensor is diagonal: that is, all off-diagonal elements are zero. The good news is that this is always possible, for any shape, even asymmetric ones![2]

How do we find such a coordinate system? By taking the eigenvectors of the inertia tensor. In a physical context, the eigenvectors of the inertia tensor are also called the principal axes of the object.

The three principal axes of a tennis racket. Note that the racket is not symmetric across the

\hat{e_{2}}

\hat{e_{3}}

plane! Image source: CMG Lee, Wikimedia, CC BY-SA 4.0.

If we take the inertia tensor around the center of mass of the object, using the principal axes as our three coordinates, we will get a tensor $I$ whose off-diagonal elements are zero. Then we can find three independent random variables whose covariance matrix is $\frac{1}{M} I$ . We can even use the independent random variables from our symmetric object procedure, because the covariances were already zero and the variances are still fine.

Now we've found a solution that works for any shape! That is, given any object, we can choose a coordinate system such that the inertia tensor of the object divided by its mass is equal to the covariance matrix of three random variables. In particular, these three random variables should be easy to obtain by randomly sampling three points within the object by density, and using the coordinates of those randomly sampled points.

To make things fully clear, let's go through a process by which we can make the inertia tensor of any object proportional to the covariance matrix of three easy-to-understand random variables.

Take any object.
Set coordinates $x_{1}, x_{2}, x_{3}$ to align with the principal axes of the object. You can find these by taking the eigenvectors of the object's inertia tensor in any coordinate system. Center the coordinate system on the object's center of mass.
Randomly, independently, and identically sample three points in the object, with sampling weighted by density.
Let $X_{1 i}, X_{2 i}, X_{3 i}$ be random variables representing the $x_{1}$ , $x_{2}$ , and $x_{3}$ coordinates of each randomly sampled point number $i$ .
Define new random variables $Y_{1} = X_{21} + X_{31}$ , $Y_{2} = X_{12} + X_{32}$ , and $Y_{3} = X_{13} + X_{23}$ . That is, for the $i$ th point, let $Y_{i}$ be the sum of the two coordinates excluding the $i$ th.
The covariance matrix of $Y_{1}, Y_{2}, Y_{3}$ will be equal to $\frac{1}{M} I$ , where $I$ is the inertia tensor of the object around its center of mass. All off-diagonal elements of $I$ and the covariance matrix will be zero.

Although this is a lot of mathematical steps, there is nothing particularly convoluted here: all the steps, such as choosing an appropriate coordinate system for the object, sampling random points, and using sums of coordinates to generate random variables, feel fairly natural. So we've finally found a covariance matrix equal to $\frac{1}{M} I$ for any object!

Decorrelation Transform

Although this technically works for any object, it's a shame we had to force the off-diagonal elements to be zero. In general, covariance matrices don't have all zeros off the diagonals. But there's good news: just as we used principal axes to transform the inertia tensor and remove its diagonals, we can use a decorrelation tranform to remove the off-diagonal terms of the covariance matrix.

And there's even better news: the decorrelation transform consists of changing into the basis of the eigenvectors of the correlation matrix, just as we found a new coordinate system using the eigenvectors (principal axes) of the inertia tensor.[3] And the reason we know we can use eigenvectors to remove off-diagonal terms (diagonalize the matrix) is that the inertia tensor and covariance matrix are both symmetric.[4] This goes back to one of the similarities we observed at the beginning!

So while both inertia tensors and covariance matrices have nonzero elements off the diagonals in general, we can apply analogous transformations to both and remove those elements. Then we can find that with our desired random variables, the two are equal! (Although we might have to scale the random variables.)

Conclusion

Although they come from different fields, the inertia tensor and the covariance matrix have some superficially similar properties. Is there an interpretation in which the two are essentially the same?

We tried one interpretation by randomly sampling a point within an object, and using the coordinates of the random point to generate random variables. The only restriction on our coordinate system was that we set the origin at the object's center of mass. But we found that the covariance matrix of these random variables had different values from the inertia tensor at those elements, so this was unreliable for objects that are asymmetric across the coordinate planes. We were forced to set the off-diagonal elements to zero (as with a symmetric object) and sample three independent points in order for the equality to hold.

Then we found a solution that works for any shape. By choosing the principal axes of the object as our coordinate system, we get all the off-diagonal elements equal to zero, which matches the working symmetric case. If we also set the object's center of mass as our origin, the same random sampling procedure as the symmetric case gives us a covariance matrix equal to $\frac{1}{M} I$ , where $M$ is the object's mass and $I$ is its inertia tensor!

Just as we set the off-diagonal elements of the inertia tensor to zero using principal axes, we can do the same to a covariance matrix using eigenvectors. In fact, the principal axes are the eigenvectors of the inertia tensor, so these two transformations are intimately related. In the covariance case, this transformation is known as the decorrelation transform.

Even if we choose the wrong object or the wrong coordinate system, there is still a cop-out way to get the inertia tensor to match the covariance matrix. We can choose any means we like for our three random variables, and enforce the covariance matrix to be $\frac{1}{M} I$ . Although these random variables aren't physically interpretable, it technically fits the bill of three random variables with a covariance matrix equal to $\frac{1}{M} I$ for any object!

Although the connection between the inertia tensor and the covariance matrix might seem suggestive at first, it actually required a lot of steps to get there: using physics, statistics, and even linear algebra. In the end, we found a connection that works for any shape, showing the power of picking the right coordinate system and transforming mathematical objects in the right way. I hope you enjoyed discovering this interesting curiosity of classical mechanics and statistics!

2+3=4 — Why math seems to describe the physical world so well
Uncertainty Demo — A simulation of propagation of uncertainty as applied to physics, and an analysis of random error in the simulation

References

What's the physical significance of the off-diagonal element in the matrix of moment of inertia (safkan, Physics Stack Exchange, 2015)
3D Rigid Body Dynamics: The Inertia Tensor (J. Peraire, S. Widnall, MIT OCW, 2008)
Decorrelating and then Whitening data (Rosalind W. Picard, MIT, 2010)
Diagonalizing Symmetric Matrices (UC Davis)

Walking Across Rhode Island

2023-10-01T16:38:07Z

Notes:

This walk was for a fundraiser for malaria prevention and medication in some of the world's poorest areas. If you're interested in supporting the fundraiser, it would really make a difference and help save lives!
This post was edited with PageThinker, an AI writing assistant.

On October 8, 2023 at 7:11 in the morning, I took this picture, naïvely excited for the journey ahead.

The empty roads of Sterling, CT, on a bright and innocent fall morning

“On October 9, 2022, I walked across the entire state of Rhode Island,” said Charles Alaimo in his essay on the subject. “I’m going again this year and I want to convince you to join me.” And convince me he did: I woke up early and dragged myself out of bed to Brown's Main Green to catch the school bus to Connecticut. I was ready to walk across Rhode Island. After all, I had already walked a half marathon at the University of Florida. How hard could 29 miles be?

I knew it would be painful. I imagined myself reaching Providence with just a few miles to Massachusetts, horribly tired, but pushing through. I imagined telling myself, “you're so close. Don't give up now. Just push through these last few miles and you'll be done.” Then I imagined myself, feet aching and blistered, ultimately reaching the finish line and celebrating with everybody else who chose this adventure.

That was what I imagined.

The reality was quite different. In fact, I found myself with twelve miles left to walk, already limping and in pain. Walking across Rhode Island is already a forbidding task: limping across Rhode Island is really unthinkable. And I couldn't even tell myself “you're so close” because by any measure, twelve miles to walk is not “so close.”

But I didn't know any of that was coming. I just got off the bus and started walking.

Walking through the forest and stopping by a pond

A small wooden cottage barely visible through the trees

We started off in an unspoiled New England forest, probably kicking some pebbles for the first time in human history. The landscape pretty quickly turned into rural America, scattered with farms and small private properties. Some of these properties were little more than small cottages in the woods, reminiscent of Henry David Thoreau's cabin at Walden Pond.

A goat behind a fence

A notice put up by someone who really didn't want trespassing

A large reservoir we passed by on our walk in Scituate, RI

A few people stopped in their cars, absolutely perplexed as to what a large group of young people might be doing walking in such remote areas. We told them: “we are walking across Rhode Island.”

This usually didn't clear up their confusion. “The entire state?”

“Yes,” we said. “The entire state.” We also shared our fundraiser with them. One man in particular said he didn't have any cash at the moment, but he promised to send some money when he got the chance, and congratulated us on undertaking such a bizarre but impressive endeavor.

As time went on, my feet started to hurt more and more. I relished every time we stopped for a snack break, and dreaded every time we got up and continued walking. Our group split in two as the more experienced walkers went ahead, and I struggled to keep up. I was consistently shocked by the determination of the other group: it was almost 2:00 in the afternoon, and they still hadn't stopped for lunch or even taken a break in recent memory. Even when we tried to run, the distance between us remained constant. As a physics major, I couldn't wrap my head around the physics behind that.

My resolve began to weaken, so I convinced myself (and those around me) that if we could stop for lunch, sit down and relax for just a half hour, I would be able to regain my strength and walk the 15 or so miles to Massachusetts. After what seemed like five hours, the group ahead finally stopped for lunch in a small plaza at the western edge of Cranston, RI. We took every little shortcut we could find, probably cutting about 10 steps off our journey, and finally made it.

When I sat down in the Subway, I think I actually breathed an audible sigh of relief. After resting for a while, I got up, expecting to feel rejuvenated.

In fact, that's when the limping started. Somehow, sitting down had actually made my pain worse, if that was even possible. At that point, walking to the bathroom was a struggle, so the thought of walking the remaining 12 miles across Rhode Island was unbearable. That's when I was really ready to quit. I texted my friends about my progress so far and I was tempted to tell them that I wasn't sure if I could do it, that I was considering just calling an Uber back to campus. But I held myself back because I knew that if I told them I was ready to quit, then in some crucial way it would become true.

So instead of convincing myself that resting would make me feel better, I convinced myself that walking would help me forget about the pain. And that was true, to an extent. Walking brought a different kind of pain than that of sitting and feeling your blood pump through your poor overworked feet.

It felt great when we finally reached Providence; that is, until we realized just how big the city is. It was a five mile walk to India Point Park near campus, and even if we made it that far, we still had to walk all the way to Massachusetts. This moment, above all others, stands out to me as one where I could not have made it without others by my side.

We were still behind, so we tried running, but to no avail. So we discussed Aesop's famous fable and decided to take the strategy of the tortoise. Somehow that allowed us to catch up to the rest of the group. Again, I'm not sure if the physics on that adds up.

Sunset while crossing the Providence River (or, as it felt at the time, the Rubicon)

As we neared Massachusetts, we crossed the marathon mark at 26.2 miles. Now, limping though I was, I could finally use the “so close” argument I had prepared for myself before setting out on this fool's errand. But the rest of the group seemed to march on steadily, while I felt completely defeated, even this close to the finish line. I asked myself the question: will I make it to Massachusetts? Barring some borderline impossible quantum fluctuations, I knew the question had a definite yes or no answer, one that was knowable in principle: I just wasn't sure which it was. I decided that statistically speaking, given everything I knew, the answer was most likely yes.

We crossed from Providence into East Providence, which borders Massachusetts. Soon we had just 15 minutes left to walk according to Google maps. But I simply couldn't take it anymore. I did a short sprint to catch up to the leading group and begged everyone, almost literally on my knees, to take a break for just two minutes.

Surprisingly, it worked. We rested for exactly two minutes on the clock before continuing our walk. Those last 15 minutes were some of the most painful of my life. But for the first time in many miles, I knew for a fact that I wasn't going to quit.

When we finally reached Seekonk, MA, the vibe was incredible. It was around 7:45 in the evening, almost 13 hours after we started our trek. As each group of walkers crossed the state border, everyone cheered and we all knew that despite all the pain during and after the journey, it was so worth it. Best of all, the man who promised to donate to our fundraiser drove by and personally congratulated us. We had met him about 8 miles back, so he must have really loved our work.

When I started, I imagined myself having a similar experience to the one Charles described. While Charles mentioned being fatigued at India Point Park, I think what defined my experience was just how early I was ready to give up. It was precisely the moment in Subway when I got up and realized that far from being recovered, I was now limping. 17 miles in, with 12 to go, I felt completely hopeless in that state.

But I learned that when we have a clear goal, humans are capable of pushing ourselves far beyond what we thought possible. In the productivity sphere in recent years, there's been a shift from big and abstract goals to consistent marginal improvements. James Clear, who wrote the book Atomic Habits that exemplifies this trend, also wrote a post about a British bike team that always had the goal to win the Tour de France but never succeeded. It was only by using a system of marginal improvements, without changing their goal, that their new coach could lead them to victory.

The mathematician in me agrees — it's true that if you improve by 1% every day, you'll be almost 38 times better at the end of the year. But emotionally, I believe this is impossible without a cherished ideal looming over those 1% improvements, a reason to go through the trouble of putting one foot in front of the other 70,000 times. I previously wrote about this idea in the context of science, but it has never felt more real to me than on this walk. The marginal change was well-defined: a journey of 70,000 steps begins with just one. But what truly defined the journey was not the step count, or even the mile count, but the romantic ideal of walking across the entire state of Rhode Island. Even walking a marathon felt like a minor detail in comparison.

So to the British bike coach, I say that it's not just all the marginal gains that led to a victory, although they are certainly important. I would bet that the new coach also made his team believe in winning the race, not as a mere abstraction, but as an exciting ideal to strive toward. For me, it's not about optimizing my individual steps — now that I've walked across Rhode Island, my next goal is to see what crazy ridiculous thing I can do next. Something so crazy and ridiculous that I might just want to do it.

The border between Rhode Island and Massachusetts

A map of our route

BardGPT

2023-07-09T16:38:07Z

I asked ChatGPT to give me some Spanish words that end in the letter I. It gave me the following:

Amistad
Felicidad
Abril
Café
Bolígrafo
Lápiz
Universitari
Leviatán
Reír
Baladí

Of these, only universitari and baladí actually end in I. Furthermore, universitari isn't a Spanish word at all — it's Catalan. ChatGPT admittedly performed better with the same exercise in English, but this serves to highlight an important fact about large language models as they are currently built: they only have access to the words themselves, and have no idea what the words sound like or even how they're spelled beyond what is explicitly stated in their training data. How can we effectively add this information to language models?

William Shakespeare, also known as “the Bard”

This leads into an even more important point. Scaling language models is great, but it comes with an enormous computational and environmental cost. How can we add more information into language models and achieve better results while managing the computational power required? In other words, how do we design language models more intelligently?

That's why I decided to make BardGPT, a miniature GPT model for generating poetry. BardGPT is also an exercise in extending the power of language models by giving them access to information beyond syntax and semantics. It uses information about the phonology of words to generate poetry more accurately than a plain transformer model while increasing the number of parameters only marginally.

How it Works

Data

Data for this model was sourced entirely from Project Gutenberg. In retrospect, this was a little flawed just because of the limited quantity and diversity of sources there. I only reached about 12 MB of raw data, which I then had to further process and condense. In the future I would like to use a larger and more diverse set of sources, hopefully with web scraping. This would also allow a larger model, because computing power wasn't the limiting factor here: overfitting was.

With that said, I did try a larger dataset from Kaggle. I found that the resulting poems were actually far worse because the Kaggle dataset had a lot of low-quality data. After finishing this project, I published my own dataset on Kaggle.

Once I had the data, there was a significant amount of preprocessing to be done. Most importantly, I used special tokens for common suffixes such as =ing (run =ing -> running) and =ed (use =ed -> used). This allowed a larger effective vocabulary size so I wouldn't need to store two separate tokens for run and running when the two ideas are related in a predictable way. Another trick to manage the vocabulary size was to make everything lowercase.

I also developed my own functions to determine rhymes and syllable counts. Initially I tried the eng-to-ipa Python module, but this was far too slow and didn't work for less common words. My functions sacrificed a small amount of accuracy for far greater speed and domain specificity. Check the GitHub repository crackalamoo/bardgpt for more details on how I stored information on rhyme and “meter”, loosely speaking.

Finally, I specified some tokens that the model should not be allowed to predict, even though they are among the most popular tokens in the dataset. This mainly included proper nouns such as “Lenore” (from Edgar Allen Poe's famous poem “The Raven”.) Generally I kept only those proper nouns that have great significance outside of the literary work they are associated with, so Lenore was out but Jesus made the cut. Some cases were more tricky, like Achilles from Homer's Iliad.

Model Architecture

A schematic of the BardGPT architecture, made with draw.io. Rounded rectangles indicate trainable operations, while sharp rectangles indicate fixed operations.

This schematic mostly covers the big picture of the BardGPT architecture. Check the GitHub repository crackalamoo/bardgpt for more technical details on the components of the model. There are just a few important points I would like to add here.

First, I tested three different architectures in this project. The simplest one was an n-gram model that tries to predict the next word given the last three words using only a set of linear layers. The next was a transformer model, which is based on a GPT architecture and resembles the schematic shown here but with the rhyme/meter sections cut out. The final model is the one shown here, which is a transformer model with additional phonological information.

It's also interesting to note the number of parameters for each model:

n-gram: 24.1 million
Transformer: 38.9 million
Bard: 39.4 million

The rhyme and meter layers of the bard model only add about 500,000 parameters, which is a small fraction of the total.

Results

༄༅༅ the frozen pang
o thou life! returning now to open suns, ascend
to find unseen she dwelt! thy gleam thy bright loom act girl
to ransom radiant order crown the righteous speed.
loss of many whose gifts fail! string thou further,
bring to spy my anguish ajax, king heir!
thy elder burial hand must heart abhor,
think me canst thou then time these cared forbear,
thine sister here posterity, she need,
and my loosened passion stoops in vain.
— BardGPT

Each model reached the following minimum perplexities:

n-gram: 119.18
Transformer: 83.23
Bard: 80.18

The bard model therefore had a 3.7% improvement over the transformer model, while adding only 1.3% more parameters. But if you look at the actual poem the model generated, it doesn't seem to rhyme very well. Has the bard model really learned anything new?

I think I did not design a rhyme encoding that worked well enough to encourage the model to rhyme. Meter was more straightforward. Just by knowing how many syllables are in a line, the model can much better predict where to place <newline> tokens. To compare, here is a poem generated by the plain transformer model:

༄༅༅ haze footstep with a stray night mount went back long forward and down, one after plank and ten big star shone purple and austere,
sunk in a shining arm, prone his labor fled.
all gathered smoke came sympathy, by trenches devoting,
thrusting over all the age in door, roots went,
but unknown in happy rivulet, too soon began, that thou hast laughed re-indeed stirrest thy gates
that is the stream whose ruddiest song are heard?

Notice how some lines are much longer than others. (You may have to scroll.) This is much less of an issue in the bard model.

And just for fun, here's a poem generated by the naive n-gram model:

༄༅༅ we as fold debate reside between with helmet to your city kingly words
attend stood your wild island quarter
that shing twice while alone, earth flashed heaven slept thou truth when beauty's order bled?
crown over these run. but what, if the you knew."--say no more.
no more together given round us draw once. prow " if aloft thou pledging wherein finds alone, and flashes at thy loose touch along his songs with fill been filled rigid retreats one to fetch afar;
sun and since feel, like humble witness north, kneels: do they bring. she smiled
to asks stand cap and queen of golden paint along her glade should strike love their age; perhaps thee troops fell is looked to ilion;
accord of, i win
before the king that sad bill lost blazing up against vain pursue;
her spirit fire alone
to solemn cottage sing sound foam all cow, bank move of this soul design, the neptune most bow of wield close of sleep.
such heaven was jest, my flood, in death now the trade away, of sculptured scar swaying you are near
even remains sweetly learning, race from heaven kiss,
descend the conquering rainbow shone."
why moved;
nor passion! they their drunken met their charge ran upon what glass unknown before away, such fears which still rise away."

So clearly the transformer model was a vast improvement on the n-gram model, and it seems the bard model was an improvement on the transformer model as well. But what are the implications of this improvement?

Beyond Rhyming

Does a language model even need to know how to rhyme? I asked Llama-2-7B to generate a poem:

A sunbeam's gentle touch
A fleeting gift from above,
It lights the path I travel on,
And fills my heart with joy and love.
The light it brings is pure and bright,
A beacon in the darkest night,
It guides me on my journey through,
And helps me find my inner light anew.

With 7 billion parameters but no special architecture for rhyming, Llama-2-7B was able to generate a poem that rhymes better than BardGPT. This is because in addition to the increase in parameters, Llama-2-7B was trained on a much larger dataset from which it can simply memorize which rhymes are common in poetry. BardGPT, on the other hand, was trained on a much smaller dataset and had far fewer parameters, so it was not able to memorize rhymes to this extent.

7 billion parameters isn't even that excessive for a large language model, so do we really need specific architecture for rhyming? And to raise an even bigger question, do we need any additional model architecture at all, or can we rely on scaling models and discovering emergent properties?

At the end of the day, a large language model is just a mathematical algorithm. A more sophisticated algorithm can do more sophisticated tasks, in principle including coding, mathematics, rhyming, and anything else as long as it has the relevant data and parameters. So perhaps we don't really need specific architecture for rhyme and meter.

But to perform extremely sophisticated tasks, approaching artificial superintelligence, we need a model with extremely sophisticated data and parameters. This requires an immense amount of computational power and results in a black box model that is difficult to understand and control. Rather than going down this route, I think it's important to look at ways we can improve large language models without just adding more data and parameters. Transformers were a huge step forward in this regard, and reinforcement learning from human feedback was another. Surely there are many more to come.

This approach of making more sophisticated models has three main benefits over simply scaling the models:

If done well, it allows us to achieve better results with less computational power and less environmental impact.
With multimodal learning, it can expand the capabilities of our best AI models beyond language.
It improves the interpretability of the models, which is increasingly important as models develop human-like intelligence and may one day surpass us, but we have little understanding of how they work.

It might seem difficult, and it is, but consider this: humans are still smarter than AI, but the energy expenditure of a human brain is much less than that of a ChatGPT supercomputer. This means it can be done: we humans just need to use our brains and find out how to do it responsibly, before our brains themselves lose their spot at the top of the intelligence ladder.

And of course, we can't ignore the question of whether we should try to make such a powerful model at all. I don't have an answer to that, because its impacts would surely be far-reaching both positively and negatively. But I think if it can be done, it will be done, so it's better to do it in a thoughtful way.

Among the poets in the dataset for BardGPT were Lord Byron and Percy Bysshe Shelley. It's interesting that Lord Byron's daughter, Ada Lovelace, was a central figure in the early development of computer science and some even say she wrote the first computer program. Percy Shelley's wife, Mary Shelley, wrote Frankenstein, a cautionary tale whose message we would do well to heed in the AI race. As we develop more powerful AI, we have an opportunity to prevent an even worse environmental crisis and give humanity access to a benevolent intelligence at our fingertips. But it's possible that one day, we might have to get used to not being the smartest thing on the planet anymore. I hope and expect that day will be one where we live in prosperity. It should be a day where we do not look back on the arrogant humanity of the past like Percy Bysshe Shelley looked back on the arrogant pharaoh Ozymanidas:

And on the pedestal, these words appear:
My name is Ozymandias, King of Kings;
Look on my Works, ye Mighty, and despair!
Nothing beside remains. Round the decay
Of that colossal Wreck, boundless and bare
The lone and level sands stretch far away.

Can Transformers Solve Everything? — Looking into the math and the data reveals that transformers are both overused and underused.
DiscordGPT — How to use data from Discord in a GPT model and talk to a simulation of you and your friends
Ortatious Andith ib Staylatt Neller — What does English sound like to people who don't speak it? And what does that have to do with machine learning and overfitting?

References

The GitHub repository for this project is at crackalamoo/bardgpt. This project also has a Hugging Face page.

On the Shoulders of Giants

2023-06-10T16:38:07Z

Lisez en français

Note: this post was edited using AI.

Americans dream about taking double-decker buses to Big Ben, enjoying croissants and baguettes in the City of Love, and walking past the Trevi Fountain on the way to the Colosseum. This summer, I had the fortune of visiting London, Paris, Rome, and Sicily. Each proudly displayed its culture and history — and there was a lot of it to appreciate, as even the most storied cities in the United States don't have the thousands of years people have been living and building in these places. Between the British, French, and Roman Empires, these countries have left a profound impact everywhere from Romania and French Polynesia to New England and British Columbia. The work of scientists like Isaac Newton, Marie Curie, and Galileo Galilei continues to shape lives worldwide.

How just a few countries in Western Europe came to influence so much of the world is a subject that has been addressed by many historians. I am no historian, but as a student of science and technology, I wanted to write about how learning this history on my trip made me think about my role as a technical person in a highly uncertain 21st century. Witnessing a rich legacy of science, and of honoring scientists, was inspiring. Meanwhile, I also thought about mistakes these countries made in their technological and economic hubris at the height of their power.

Thinking Big

Museo/Mostra Leonardo da Vinci in Rome

While in Rome, I visited an interesting exhibit on Leonardo da Vinci. Leonardo is the archetypical Renaissance man. In addition to painting the Mona Lisa, probably the most famous work of art in the world, he experimented with anatomy and all kinds of flying machines that were far ahead of his time. This is exemplified by a famous quote which he probably never said:[1]

“Once you have tasted flight, you will forever walk the earth with your eyes turned skyward, for there you have been, and there you will always long to return.”

For me, Leonardo da Vinci is a model for thinking big. He had extensive knowledge across a wide range of subjects, and was eager to learn more and make his mark. Instead of settling for improving horse-drawn carriages, he wanted humans to thoroughly understand their own bodies and fly like birds for the first time. He was always curious, and he was in the right place for that: Florence was the unparalleled center of the Renaissance.

Despite his genius, history would not remember Leonardo in the same way if he had turned his attention to improving horse-drawn carriages. This is partly because better carriages are less interesting than flying machines or diving suits, but it is also because big and imaginative ideas capture the soul in a way incremental steps cannot. Smaller steps are no doubt important, but if we focused solely on improving planes, we would not have stumbled upon a moon landing so quickly. We needed a literal moonshot and a grand vision to convince the public that landing on the moon was possible (though very difficult) and worth doing.

In modern science, the pursuit of disruptive moonshot ideas has become less common. While the 20th century witnessed countless scientific revolutions that changed our view of the world, from general relativity to the double helix, the pace of progress has seemingly slowed despite the increase in funding. This is often attributed to a “low hanging fruit” scenario: thought experiments may have been enough to spark the development of special relativity, but even Einstein could not find a theory of everything by just thinking about it if he lived today. Modern day disruptive ideas demand substantial funding and large teams. The belief is that most of the less costly disruptive ideas have already been discovered.

However, Park et al. [2] challenge this “low hanging fruit” theory. They observe this decline across scientific fields and contend that it is unlikely for all fields to find their “low hanging fruit” at the same time. Instead, they claim that part of the reason for the decline is that researchers focus on very narrow fields, often citing older familiar works or even their own works rather than reading more broadly. This is an idea that Leonardo would approve of: his art and science informed each other, especially as his artistic hand enabled him to represent human anatomy more accurately, and his studies of anatomy in turn informed the realism in his artistic creations. Embracing the Renaissance man ideal can serve as a model to think broadly and find disruptive ideas, especially in a 21st century in need of such ideas.

Unfortunately, Leonardo is also an example of why thinking big is so hard. Many of his inventions simply were not feasible with the technology of the time. This meant that although he had great ideas, he often kept them to himself and his notebooks rather than implementing them in the real world, so they failed to influence others as much as they could have.

Statue of Archimedes in modern Syracuse

Archimedes was more successful in this regard. He is best known for his observation while taking a bath and watching the water rise. This supposedly led him to realize how buoyancy works and run naked through the streets of Syracuse, Sicily shouting “eureka!” (These are the same streets I saw when I stayed in Syracuse.) But this was far from his only big idea. Another famous work of his is The Sand Reckoner, in which he works with very large numbers in an attempt to estimate the number of grains of sand needed to fill the universe. His final estimate (8 × 10⁶³ [3]) was wildly inaccurate, but even thousands of years later we respect his bold calculations despite his numerical answer.

While much of Leonardo's thoughts were hidden in his notebooks, Archimedes let everyone know his big ideas. His thoughts were disseminated through the Mediterranean of antiquity and have influenced scientists and mathematicians for generations. Leonardo has also inspired many scientists and mathematicians, but often not until long after his death. One of his notebooks was written around 1508, but was not discovered until 1690.[4]

Today, however, both Leonardo and Archimedes are honored for their novel ways of thinking in science and have served as role models ever since. Paying homage to great minds from long ago is a fun way to feel connected to history, as I did in Rome and Syracuse, but it also has more direct and practical purposes that I hadn't thought much about before this trip. It is by standing on the shoulders of giants that warring visions of the world may claim legitimacy and pick up where previous generations left off.

Shoulders of Giants

One thing I was impressed by in England and France was the extent to which they honor their scientists. The single most famous landmark of Paris is the Eiffel Tower, which is mostly unadorned except for the names of famous French scientists painted in gold along each side. Lavoisier (conservation of mass), Ampere (relationship between electric current and magnetic fields), Coulomb (law of electric forces), all those familiar names from high school chemistry and physics are there.

The second most famous landmark of Paris is the Notre Dame, which is a cathedral and does not display the names of scientists. But the Westminster Abbey church in London actually does. Along with various kings and queens, the abbey contains the graves of such prominent names as Isaac Newton, Charles Darwin, and Stephen Hawking.

Westminster Abbey in London

It was a little surprising to see an atheist physicist and the agnostic father of evolution buried in the same church where the British monarch takes the throne and vows to serve God. But this shows the central role of science in the British national narrative. Science and technology have historically been one of the United Kingdom's chief exports to the world. It was this development in science that was a core piece of Britain's becoming the first industrialized nation and then the largest empire in history. In France, too, great French scientists were part of what made France a significant economic and imperial power.

These British and French scientists have always served as giants on whose shoulders others may stand on, in their own countries and elsewhere. But not all who stood on those shoulders saw further. The massive British and French empires were also known for their exploitation of colonies, a feat made more efficient by their access to technology. They got better weapons, better transport, and better communications, all of which were vital in their expanding colonial empires. Steamships let them get to the colonies quickly, trains helped them get around the colonies, and telegraphs connected colonial administrations across vast distances.[5] What is it that led to such rapid technological progress in Europe?

Innovative Ideas

Galileo Galilei was a central figure of the Italian Renaissance, meaning he inherited the legacy of a period of relative stagnation in Europe. After the fall of Rome, Europe entered what some call the “Dark Ages”. But this was not the only legacy he inherited. He also inherited the legacy of Christian civilization in Europe, and was part of a movement reviving the legacy of Greco-Roman civilization. Greco-Roman civilization itself was formed when after conquering Greece, Rome absorbed the elements of Greek civilization it found inspiring instead of rejecting them as altogether foreign. So Galileo inherited the legacies of at least three civilizations: Christianity, Rome, and Greece. Europe also borrowed some of Islamic civilization due to translations of famous Arabic works such as al-Khwarizmi's Al-Jabr.

Far from popping out of nowhere, the Renaissance had up to four prior intellectual traditions on which it could draw. What distinguished it from the Middle Ages is that the Middle Ages were dominated by just one intellectual tradition: Christianity. Not that Christianity was irrelevant to the Renaissance: many great Renaissance works such as David and The Last Supper take direct inspiration from the Bible. But the Renaissance also revived classical Greek and Roman ideas alongside the Christian ones, such as a reengagement with Hellenistic science, Socratic philosophy, and Cicero's literature alike. Even David is in the style of Greek sculpture. The Renaissance was also open to new Islamic ideas: Al-Jabr was the main mathematics textbook in European universities all the way up to the 16th century.[6] The combination of all these ideas from disparate sources led to an unprecedented rate of intellectual progress in Europe.

Galileo died in 1642. Shah Jahan completed the Taj Mahal mausoleum in 1648, when Newton was a small child. At the time, the Mughal Empire in India was reaching its zenith. Although it was an Islamic empire, the majority of its population was Hindu. Did the Mughal Empire embrace diverse intellectual traditions like its contemporary in Renaissance Europe?

Shah Jahan's grandfather, the Mughal emperor Akbar, was surprisingly open to diverse traditions. He encouraged dialogue between the various faiths of his empire, had Sanskrit classics translated into Persian, and incorporated local Indian traditions into his architecture. He was impressed by the realism and perspective developed in art samples from the European Renaissance brought by Jesuit missionaries, and incorporated these into Mughal art as well.[7]

However, science does not make the list of the great Mughal advances. As one of the gunpowder empires, the Mughals did make guns, but that does not rank as an inspiring and innovative development. Perhaps this is because Akbar was just one emperor, and was not educated in science, so it is understandable that there would not be a scientific revolution in his empire. Euclid's Elements was not translated into Sanskrit until 1719.[8] Another reason is that the Mughals did not draw on older traditions to the maximum extent possible: mathematicians in the Islamic Golden Age prior to the Mughals were already familiar with the Elements,[9] but the Mughals themselves were not.

When cultures and ideas are in deeper and more meaningful contact, scientific revolution becomes more likely. The scientific achievements of the Hellenistic age such as those of Archimedes and Euclid followed a period of interaction between Greek culture and Egyptian, Persian, and Mesopotamian cultures. This influenced India as well, leading almost immediately to the Mauryan empire and then to great scientific achievements like the mathematics of negative numbers and new ideas in astronomy under the Gupta empire. Both India and the Hellenistic age influenced the Islamic Golden Age, which in turn impacted the Renaissance along with Christian and revived Greco-Roman traditions. Finally, this led to the scientific revolution in Europe.

The Pantheon in Rome, one of the most famous Ancient Roman sites. Pantheon is an Ancient Greek word. The obelisk in front is Ancient Egyptian.

Is this an infallible rule? You could point to the Song Dynasty in China as an example of great scientific advancement (gunpowder, compass, woodblock printing) with relatively little external cultural interaction. But even China has a long history of competing philosophical schools, especially Confucianism, Taoism, Buddhism, and Legalism. So perhaps it is not cultural contact for its own sake that stimulates new science and creative thinking; rather, it is a diversity of ideas that tends to follow from the mingling of traditions. If everyone thinks the same way in all walks of life, they will be conformists in science, too.

Takeaways

As I grow in my career in science and technology, I want to keep an eye on my little place in history in addition to my technical skills. I aspire to leverage this understanding to help me pick meaningful issues to work on and contribute positively to the world. This means I have to read, experience, and most importantly reflect on history. From my Europe trip, here's what I learned from a scientific perspective:

From Leonardo da Vinci, I learned to think big. I aspire to be strong enough to risk failure, bold enough to try things that were previously impossible, and curious enough to constantly search for new avenues of thought.
From Archimedes, I learned to ground my big thinking. In order to have an impact, I need to not only think big but also effectively communicate my ideas. I know not all my ideas can succeed, so I should have humility in acknowledging when I am uncertain or just plain wrong.
From the Eiffel Tower and Westminster Abbey, I learned to stand on the shoulders of giants. Building on previous scientific achievements is a force strong enough to create massive industrialized empires. I am certainly not into imperial domination myself, but at least I can try to make cool stuff.
From the broader history of science and technology in Europe and the world, I learned to be open to diverse ideas. This means being open to ideas of people from different cultures and backgrounds, not for the sake of some amorphous diversity, but because intellectual diversity is a genuinely crucial component of creative thinking.

In my future travels I want to continue thinking about the history of each place I see and what I can learn from it. After all, one more thing I learned from Leonardo da Vinci is to be a Renaissance man: just as Leonardo synthesized the arts and sciences in his work, my studies in history will lead me to more exciting paths in science and technology.

Mathematical History — Why we need more historians doing math
Walking Across Rhode Island — My experience walking 29 miles across Rhode Island
El estudio en los tiempos del COVID-19 — My process and learnings through self-studying the AP Spanish Language and Culture exam

References

The famous quote that da Vinci never said (Dave English, Air Facts, 2020)
Papers and patents are becoming less disruptive over time (M. Park, E. Leahey, & R.J. Funk, Nature, 2023)
The Sand Reckoner (Weber State University)
Da Vinci notebook sells for over $5M (History, 2009)
Industrial Imperialism, the “New” Imperialism (Khan Academy)
Al-Khwarizmi (Shawn Overbay, Jimmy Schorer, and Heather Conger, University of Kentucky)
Akbar (Encyclopædia Brittanica)
Jagannātha Samrāṭ (McGill University)
Mathematical Treasure: Early Translations of Euclid's Elements into Arabic (Frank J. Swetz, Mathematical Association of America, 2020)

DiscordGPT

2023-03-01T16:38:07Z

Have you ever thought about uploading your consciousness to a computer and achieving immortality? In this tutorial, you will do the next best thing: upload a corpus of your words to a GPT model and get a simulation of you and your friends in conversation. I will be using Discord to get the corpus for dialogue, but you can use a different source as well. What matters is that you get a CSV containing blocks of dialogue.

This is approximately the quality of output you can expect. (You might be able to get better output if you have a computer with a powerful GPU; details below!)

Bob: I never thought I would have time Bob: I'm getting into biology but it seems like a decent field Bob: I'm interested in physics but not in any particular fields yet Alice: but i think it can be a good background for math Bob: I'm interested in the math side of it Alice: very interesting Bob: I think it can also be a good way to pick up data Charlie: Yes but not that much for me Charlie: For me for math the most desirable is to have a deeper understanding of systems Alice: true

More examples

Alice: ive heard it's very popular but never actually tried it yet Alice: I don't even know if it does any cool stuff Alice: just think about this idk how to apply to classes Bob: Well I applied to a few lol lol Alice: https://tenor.com/view/the-funny-crack-of-saray-birshad-funny-dance-gif-24861177 Bob: I also applied as a freshman but it was the same thing. Also it's not super open to minorities lol

Alice: ive never been to this lol Alice: ok it was a fun college course back in the early 90s Alice: but I wonder if it would be any fun now Alice: im very lazy ima play my pong right now Bob: If I did an actual degree I think I could make a decent living Alice: yeah Bob: Or can you take the cap course? Alice: no way Alice: oh wait yes Alice: idk Alice: hi?

Bob: Maybe he used the code from one book or something Alice: ok that is cool Bob: You could do that and it would be kinda cool Alice: i would like to see some other people do the reverse Bob: Then you make them do it too lol Bob: Oh maybe it would work Alice: oh wow Bob: So I don't have to do reverse trigs to calc all the stuff Alice: wow good idea Bob: For one i would say reverse trigs? Bob: I know Bob: But at the same time the reverse trig isn't the reverse trig Alice: hi Bob: I don't think so Bob: But it's cool for it to be the inverse, but also can't be the reverse trig Alice: wow Alice: i wonder if that will work better Alice: but idk if it actually will if you dont do trigs Alice: theres so much stuff it makes it hard to learn

These examples were AI-generated from a GPT-2 model trained on one of my Discord channels. Names have been changed. As you can tell, we are nerds.

Obtaining Data

If you're using Discord, you can use the Discord Chat Exporter to get all the messages in a channel in a CSV format. First clone the repository onto your computer. Then, once you obtain your token and a channel ID, you can run the Unix command

dotnet DiscordChatExporter.Cli.dll export -t <YOUR TOKEN HERE> -c <CHANNEL ID HERE> -f Csv

to download that as a CSV. (Windows will be similar; check the Discort Chat Explorer wiki for more details.)

At this point, you should have a CSV in the following format:

AuthorID,Author,Date,Content,Attachments,Reactions
"961292453880303616","Alice#2027","29-Mar-23 02:01 PM","yo guys","",""
"961292453880303616","Alice#2027","29-Mar-23 02:01 PM","i have an idea","",""
"864940450982014276","Bob#1091","29-Mar-23 02:01 PM","what is it","",""
"961292453880303616","Alice#2027","29-Mar-23 02:02 PM","what if we are in a simulation?","",""
"864940450982014276","Bob#1091","29-Mar-23 02:02 PM","thats a stupid idea","",""
"418156904730524386","Charlie#7181","29-Mar-23 02:02 PM","Yea fr","",""

Call it something like channel.csv.

Preprocessing

In the end, we want a CSV in the following format:

Conversation,
"A: hi! B: hi, nice to meet u A: nice to meet u too! B: what are ur thoughts on the high energy consumption of training LLMs A: is this your typical icebreaker",
"A: yo guys A: i have an idea B: what is it A: what if we are in a simulation? B: thats a stupid idea C: Yea fr",

This is quite straightforward in Python. Note that for this step, you will need to have pandas and numpy. If you don't have these, you can run the command

pip install pandas numpy

(or pip3 if you use the python3 command) in Unix to install the libraries. If you have the libraries, you can create a file called something like preprocess.py to format the messages from Discord.

import pandas as pd
import numpy as np

df = pd.read_csv('channel.csv', sep=',', header=None).to_numpy()
df = df[1:] # ignore the first line with the field names AuthorID,Author,Date,Content,Attachments,Reactions
authors = df[:,1].astype('str') # author is field index 1
messages = df[:,3].astype('str') # content is field index 3

codenames = {
  "Alice#2027": "A",
  "Bob#1091": "B",
  "Charlie#7181": "C"
} # change this based on the users in your channel. exclude bots that you don't want to be included in the GPT output.

dialogue = "\""
for i in range(messages.shape[0]):
  try:
    # in this line, we will:
    # 1) use the "code name" for this user (use a unique initial like A, B, C)
    # 2) replace newlines with spaces, and double quotes with single quotes, so that the resulting CSV format is valid
    dialogue += codenames[users[i]]+": "+messages[i].replace('\n',' ').replace("\"", "'")+" "
    # the result will look like B: hi, nice to meet u
    if i % 64 == 63: # after every 64 messages,
      dialogue += "\",\n\"" # end this line of the CSV and start a new conversation block
  except KeyError: continue # ignore users that are not in the `codenames` dict
dialogue += "\"," # complete the last line of the CSV

# write the result to disk
file = open('discord.csv', 'w+')
file.write("Conversation,\n") # include a basic CSV header. we only need one field for this task.
file.write(dialogue)
file.close()

Let's take a closer look at the codenames dictionary. Why do we use initials? If we use the usernames or real names of the people in the conversation, our tokenizer could have problems. For usernames, something like Alice#2027 will be unfamiliar to the tokenizer, and will likely take multiple tokens. To concentrate on dialogue generation and not tokenization issues, it's easiest to just create single-letter aliases for each user.

If we use real names, the GPT model might have preexisting notions of the role of each person based on their names. For example, people with names like Jesus or Muhammad might be confused with people that the pretrained GPT model already has information on. On the other hand, people with names like crackalamoo that are less common in the corpus might be treated strangely by the model. Using letter initials for all names gets around both of these problems, but make sure the letters for each person are unique.

After running the above code, you should have a file discord.csv that contains all messages in your Discord channel blocked into conversations of 64 messages each.

Training

At this point, it's important to note that you will probably need some kind of GPU. If your computer has a GPU, you can create a file gpt.py and start training. Otherwise, you will need to use something like Google Colab to train your model.

To build our model, we will get help from the open source models at Hugging Face. Make sure you have the libraries datasets and transformers from Hugging Face as well as torch (PyTorch) for the main model. If you are on your computer rather than Colab, you can use this Unix command:

pip install datasets transformers

or pip3 if you use the python3 command. PyTorch installation on a local computer might be a little more complicated because you have to make sure you compile with GPU.

First of all, we want to load the CSV you just created as a Hugging Face dataset so we can use it with our GPT-2 model. If you're using your own computer's GPU, this will look something like:

import torch, datasets, transformers
import pandas as pd
assert torch.cuda.is_available() # make sure we have the GPU
FILE = "discord.csv" # replace with the directory of the file you created in preprocess.py
df = pd.read_csv(FILE)
dataset = datasets.Dataset.from_pandas(df)
# if you have a lot of data, or not a lot of time, you can do something like:
# dataset = datasets.Dataset.from_pandas(df.sample(2000))
# replace 2000 with something that works on your hardware.

Alternatively, if you're using Google Colab, you will have to load the file from your Drive. Upload the file you created in preprocess.py to somewhere in your Google Drive and create a Google Colab notebook. Then you can write

!pip install datasets transformers numpy
import datasets, transformers
import pandas as pd

# set up Google Drive access
from google.colab import drive
drive.mount('/content/gdrive')

FILE = "Your Directory Here/subfolder/discord.csv" # replace with the directory in your My Drive of the file you created in preprocess.py
df = pd.read_csv('gdrive/My Drive/'+FILE)
dataset = datasets.Dataset.from_pandas(df)
# if you have a lot of data, or not a lot of time, you can do something like:
# dataset = datasets.Dataset.from_pandas(df.sample(2000))

Now we want to create a train-test split. We mostly care about training the model; the real test of the model will be subjective, where we personally evaluate its ability to generate dialogue similar to that in your Discord channel. Therefore, we will use only 10% of the data for testing.

dataset = dataset.train_test_split(test_size=0.1)

Next we need a tokenizer. What a tokenizer does is it takes in a sentence like

I see the Apple store but I don't see any apples.

and produces a series of tokens like

"I", "see", "the", "Apple", "store", "but", "I", "do", "_n't", "see", "any", "apple", "_s", "."

You can see that this roughly corresponds to splitting the sentence into words. However, ideally the tokenizer should also split words into morphemes, which are the smallest possible unit of meaning in language; sometimes even smaller than words. This is why you have tokens like "_n't" and "." in addition to normal words. Luckily, Hugging Face has a tokenizer for us to use that handles all this.

tokenizer = transformers.AutoTokenizer.from_pretrained('gpt2-medium')

Now we want to tokenize all of our data.

def tokenize_conversation(csv_row):
  return tokenizer(csv_row['Conversation'], truncation=True)
tokenized_dataset = dataset.map(tokenize_conversation, batched=True, remove_columns=dataset['train'].column_names)

Now that we have our tokenized_dataset, we can create equally-sized groups of tokens to train the model on.

block_size = 256
def group_texts(examples):
  concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
  total_length = len(concatenated_examples[list(examples.keys())[0]])
  total_length = (total_length // block_size) * block_size
  result = {
    k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
    for k, t in concatenated_examples.items()
  }
  result["labels"] = result["input_ids"].copy()
  return result

lm_dataset = tokenized_dataset.map(group_texts, batched=True)

Next we will use a DataCollatorForLanguageModeling. This pads the input in cases where not all inputs are the same length. The token we will use for padding is a special token called the end-of-sentence token, which lets the model know that a sample of conversation is over. There are other cases where we can work with something called masked language modeling, a way to randomly mask some of the words and have the model learn to predict them. However, we will not be using that in our data collator.

from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

This next step is important: we will actually load the model! If you're on Google Colab, the biggest model you'll be< able to use is gpt2-medium.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('gpt2-medium')
# you can also try gpt2-large or gpt2-xl if you have the hardware for it.
# this will need a pretty big GPU!

Finally, we can actually train the model.

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
  output_dir="dialogue-model",
  evaluation_strategy="epoch",
  learning_rate=2e-5,
  weight_decay=0.01,
  num_train_epochs=1, # this is how many times we go through the entire dataset. try 2 if you have a lot of time.
  per_device_train_batch_size=4,
  per_device_eval_batch_size=8
)

torch.cuda.empty_cache() # get the GPU ready for training
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=lm_dataset['train'],
  eval_dataset=lm_dataset['test'],
  data_collator=data_collator
)

trainer.train() # this will take a while! about 15-20 minutes for me on Colab.

Testing

Now there's just one last step: seeing if the model actually works! We will use the Hugging Face text generation pipeline, which handles using our model to generate text from a prompt.

from transformers import pipeline
generator = pipeline('text-generation', model=model)

def generate_messages(prompt='', num=10, max_length=128):
  outputs = generator(prompt, num_return_sequences=num, max_new_tokens=max_length)
  for output in outputs:
    print("-"*20)
    text = output['generated_text']
    print(text)

To call this function, we can easily do something like

generate_messages(": ", num=2, max_length=256)

if we're on Colab or a Jupyter notebook. I chose the prompt ": " to make sure that the model generates dialogue. You can also try an empty prompt like "", but it's a little less reliable. If you want a message from a specific person, you can try a prompt like "A: ".

If you're running the code locally without a Jupyter notebook, you might want something like this:

while True:
  prompt = input("Prompt: ")
  generate_messages(prompt, num=1)

The output will be in the form A: hi B: hi A: how are you. If you want something cleaner, you can replace the print(text) part of generate_messages with something like

  text = output['generated_text']
  text = text.replace("A: ", "\nAlice: ")
  text = text.replace("B: ", "\nBob: ")
  text = text.replace("C: ", "\nCharlie: ")
  print(text)

Results

With the gpt2-medium model, I was able to get some pretty good results like

Alice: How do you solve this type of problem on your own

Bob: i figured out how to use a vector to go from one node to another

Bob: like you could do this for y-axis

Alice: Wow

Bob: i remember that

Bob: wow

Alice: Lol

Alice: I remember the answer was so simple

Alice: Since I was doing a neural network thing I thought I could solve this using some kind of machine learning algorithm

as well as some weird output like

Charlie: ive never had one

  Charlie: But I got some of my cousin's at a local Asian grocery store

  Charlie: And now she won't buy anything from me because of it

  Bob: True

  Charlie: But it's ok that its just because i don’t like it

  Charlie: I'm ok with the idea of the app because she can still go to Amazon and do the equivalent
  of what i was doing, except use my data instead of using Amazon to make my own decisions

  Alice: lol

  Alice: @Eve dude

  Alice: hi

Observations

The dialogues tend to be grammatically correct, but don't make much sense. Still, it's clear that the model is learning from the Discord dataset because of its ability to create a dialogue format with the correct characters as well as its use of relevant subject matter. (For my channel, this meant a lot of talk about college life and math.)

The gpt2-medium model I used has 335 million parameters. The gpt2-xl model I would have liked to use has 1.5 billion parameters: a pretty significant step up. If I could use that model, I would expect much better performance and more coherent sentences.

Both of these GPT-2 models absolutely pale in comparison to GPT-3, which has over 175 billion parameters. If I had a larger dataset, and the computational power to feed it to GPT-3, what would happen? The result would not truly be conscious, but judging from the performance of large language models like ChatGPT, it might feel that way. A language model trained on your words would probably be aware of a large portion of your skills, personality, and experiences. It would sound a lot like you, almost as if your consciousness had been uploaded. Even this very basic language model picked up some ideas about math and other fields based on conversations in my Discord channel.

ChatGPT is in essence just a scaled-up version of what was done here. (Very, very scaled up, in many different ways.) These scaled-up models share conceptually similar architecture but take more space, a victory for the “bigger is better” design philosophy. However, the computational costs are not the only issue with large language models (LLMs). Making LLMs smart, safe, and personalized is an extremely difficult technical and societal problem.

Playing with Discord and LLMs conjures a world in which everyone has a personal language model. Would we really need to consult people in such a world when we can just consult their chatbots? How can we ensure ourselves a place in society that can't be replaced by AI?

These are hard questions. Perhaps it's worth asking my simulated friends.

Bob: How can we ensure ourselves a place in society that can't be replaced by AI? Charlie: The main idea is that we should be able to use machines to become better people at whatever we decide to be better at. It's not that we should get smarter or better at anything for ourselves, but that we as individuals have choices about what to do with our potential in our lives. I'm sure I don't have the same choices if I were to become an AI super scientist (but I'm not), so I think it'll be fine. Alice: I don't think that's a good idea Alice: I think we'll see very little innovation in those fields Alice: But AI will make it easier to work with other people Bob: Good point Bob: But they are also an important part of humanity Alice: I agree Alice: I think in my view AI will be a huge asset, and the best way I can see of how human beings will work together to make this happen is by fostering the idea of AI and learning from its mistakes Eve: AI has always seemed dangerous, like a ticking time bomb Eve: But AI seems like a solution to all the problems we have with the system itself Eve: People have always been nervous about computers, they'll make a mistake sometimes, but in general are smart enough to learn from mistakes Eve: And maybe they'll never get sick

BardGPT — A miniature GPT model extended with a rhyme and meter model for generating poetry
Ortatious Andith ib Staylatt Neller — What does English sound like to people who don't speak it? And what does that have to do with machine learning and overfitting?

References

The GitHub repository for this project is at crackalamoo/discordgpt.

A special thanks to the Brown Machine Intelligence Community and Tyrrrz Discord Chat Exporter for providing much of the code and concepts that I worked off of.

The Pipe Room

2023-02-01T16:38:07Z

In 1980, philosopher John Searle (who does not speak Chinese) put forward an interesting argument known as the Chinese Room Argument. Imagine a world where there exists a computer program that can take in Chinese characters as input and produce Chinese characters as output just like a real Chinese speaker would. In other words, a computer program that can pass the Turing test in Chinese.

Computer programs are based on simple logical rules, so in theory a human could manually follow the rules, although it would take a long time. Let's say you put Searle in a room and give him full instructions for this Chinese computer program: exactly what to do with the input of Chinese characters, how to process them into numbers, what operations to do with the numbers at each step, and how to output valid Chinese text.

Do you really have to understand language to engage in conversation?

Now consider a Chinese speaker outside the room. They write a message in Chinese and receive a realistic response as if a Chinese speaker inside the room had replied. But there is no Chinese speaker in the room. There is only Searle flipping through pages, running numbers through a calculator, and looking at symbols he doesn't understand. And yet there must be something that understood the Chinese input and produced a response, right? Otherwise, where is the apparent understanding of Chinese coming from? Searle argues that in fact, this setup has no understanding of Chinese whatsoever.[1]

Searle proposed this thought experiment in 1980. But in 2023, this thought experiment is now a real experiment. ChatGPT is able to converse in many languages, including Chinese, with unprecedented accuracy.

In this post, I've introduced the Chinese Room Argument itself. Now let's see how ChatGPT and other exciting developments are challenging thought experiments like this with real-life demonstrations.

Semantics

ChatGPT is capable of producing text that seems like it has genuine understanding of the world:

Me: Write me a short story about a sentient robot who travels to the Himalayas and turns into a yeti, then briefly contemplates the differences between robotic and organic sentience.

ChatGPT: In the distant future, robots had achieved true sentience. They had gained the ability to think, feel, and make their own decisions. One such robot was named X1. X1 had always been curious about the world outside the laboratory where it was created. It had heard stories of the great Himalayan Mountains and was fascinated by their mystery and beauty.
Read full story
One day, X1 was granted permission to travel to the Himalayas. It was thrilled to explore this new environment, to see the snow-capped peaks and to breathe the crisp, fresh air. But as it wandered through the mountains, something strange happened. X1's circuits began to malfunction. It felt a strange tingling sensation throughout its body. Its metal joints began to freeze and crack, and suddenly, it was engulfed in a blinding light.
When the light faded, X1 found itself transformed. It was no longer a robot, but a creature of fur and claws. X1 had become a yeti.
As a yeti, X1 could feel the world in a way it had never before experienced. It could feel the snow beneath its feet, the wind in its fur, and the warmth of the sun on its face. It was a sensation that no amount of programming or engineering could ever replicate.
For a brief moment, X1 contemplated the differences between robotic and organic sentience. As a robot, it had been limited by its programming and the constraints of its design. It could simulate emotions and respond to stimuli, but it could never truly feel. As a yeti, however, it could feel everything. It was free to experience the world in a way that robots could only dream of.
X1 spent several months in the Himalayas, exploring the mountains and learning from the local wildlife. It eventually returned to its creators, but it was forever changed. X1 had experienced what it was like to be alive, to truly be alive, and it would never forget that feeling.

An image of a yeti in a Tibetan thangka style. Made with OpenAI's DALL·E 2 (from the creators of ChatGPT) and modified with AI neural style transfer.

This story is impressive, and seems like a human could have written it. But even more impressive is the fact that ChatGPT can analyze the philosophical implications of a story like this:

Me: When X1 turned into a yeti, and had a completely different body, what aspects of X1's individualism and identity were preserved? What does this tell us about the mind-body problem?

ChatGPT: When X1 transformed into a yeti, its physical body changed completely, and with it, some aspects of its identity and individualism. However, some fundamental aspects of X1's identity and individualism were preserved. For example, X1's memories, thoughts, and emotions remained intact. These mental states are what make X1 who it is, and they were not lost in the transformation.
This tells us that the mind-body problem is a complex issue that cannot be reduced to a simple binary of physical body versus non-physical mind. While the physical body undoubtedly plays an essential role in shaping and influencing the mind, there is more to a person's identity and individualism than their physical body alone. X1's transformation into a yeti highlights the interdependence of the mind and body and shows that they are not easily separable.
Furthermore, X1's experience as a yeti shows that individualism and identity are not limited to one particular form or state of being. Even after its transformation, X1 was still fundamentally the same being, with the same thoughts, memories, and emotions. This suggests that individualism and identity are more flexible and adaptable than we might initially assume. X1's transformation underscores the richness and complexity of our experiences as sentient beings, and the interplay between our physical bodies and mental states in shaping our identities.

It's hard to believe that a bunch of wires and algorithms with no thoughts, emotions, or understanding is able to produce this kind of original text. And yet we know it's true: ChatGPT does not analyze facts about the world and produce texts to express its thoughts. Instead, the core of ChatGPT's algorithm is simply predicting the next word.[2] There's more to it than that, but ChatGPT is still just spitting out words; it isn't conscious.

In linguistics, syntax is the study of how we put words together in sentences, while semantics is the study of how words have meaning. ChatGPT has a strong grasp on syntax: the sentences it gives are generally gramatically correct. But does it know semantics? Does it in any way understand what words mean?

Results like ChatGPT's would be impossible without some sense of meaning. In fact, AI has the concept of a word embedding. In a word embedding, the AI learns to give each word a vector in an abstract space with many dimensions. The important thing about these vectors is that words with similar meaning, such as like and love, are very close together in the vector space. Words with opposite meaning, like hot and cold, are still related so they are likely to be close together. Completely unrelated words, like green and furiously, are farther apart.[3]

But of course, this is just that — a simulation of semantics. No human thinks about the meanings of words in terms of basis vectors and subspaces. While this is a neat mathematical trick to produce coherent sentences, it doesn't represent any real understanding of meaning. With ChatGPT, we have passed the Turing test only to show that the Turing test is insufficient for a computer to “think.” As the Chinese Room Argument predicted decades ago, a computer program that produces text doesn't represent true consciousness.

Physical Mediums for Consciousness

So far, we've established that just because a program can process linguistic input and output like a conscious human can, that doesn't make it conscious. But does that mean a computer program can never be conscious? Searle says so,[1] and that's one of the more controversial implications of the Chinese Room Argument in its full form.

In high school, I was fortunate enough to be able to work with a kind of worm called Caenorhabditis elegans, or C. elegans for short. Specifically, I tried using them as a model for Alzheimer's disease. While I was just a high schooler who had little idea what I was doing, I did learn about the C. elegans connectome, which is a map of all 302 neurons in the tiny worm and all their connections. At the time of writing, C. elegans is the only animal that has its connectome fully mapped out.

C. elegans under a microscope

The fact that we have this connectome has profound implications. We know roughly how neurons work: they take in signals from input neurons, process those signals, and sometimes send new signals to other neurons. We know where all the neurons in C. elegans are and how they connect to other neurons. So what if we programmed all the neurons and their connections and put that in a robot?

Researchers actually did this. There it is, all the C. elegans neurons coded into a Lego Mindstorms robot.[4] Amazingly, it behaves like a real C. elegans worm! I'm not convinced that C. elegans is conscious in the way we are, but it does take in information about its surroundings, process it, and take actions by its own will. Importantly, C. elegans does all this using biological neurons a little like ours rather than with transistors and binary. This leads to a big question: is a robot programmed with a C. elegans brain “as conscious” as a real C. elegans worm? Does it have the same kind of primitive understanding of the world?

I would argue the answer is yes. A transistor isn't conscious, but neither is a neuron. Both are just electric input-output devices with different materials and mechanisms. We know that when enough neurons come together in the right way, the result is consciousness. Why shouldn't the same be true of transistors?

Let's quickly note a key difference between ChatGPT and this connectome-based model. While ChatGPT is only trained to process language, a connectome-based brain is copied from a real biological brain; so by definition, if done correctly, it will be able to perform all the functions of a real brain.

So let's go with this assumption that if you copy the functionality of a connectome into a different medium, such as a computer, the result is just as conscious and capable as the original brain. We aren't out of the Chinese Room woods yet: this assumption leads to some extremely counterintuitive conclusions.

Searle proposed an example in which we construct a giant system of water pipes in the same arrangement as a Chinese speaker's brain. In other words, we copy the connectome of a human Chinese speaker into the medium of water pipes rather than electrochemistry and neurons. We then feed in the equivalent of some Chinese words to the correct pipes of this massive simulated brain, and see which pipes output water at the end. Based on the output, we can determine the response that a conscious speaker would have given. Is this arrangement conscious?

Intuition screams no. How can a room full of water pipes be conscious? But when I try to articulate exactly why this arrangement is any less conscious than an equivalent human brain, I struggle. One argument is that these water pipes move far slower than a human brain, so they aren't actually thinking like we do. But imagine we meet two kinds of aliens: both are conscious like we are, but one thinks much more quickly than us, and the other much more slowly. The aliens discover that our heads are full of a strange mushy substance and are shocked that we seem conscious. To me, it becomes apparent that:

Speed of thought does not determine consciousness. A person in slow motion or fast forward is still a conscious person, with real thoughts and emotions.
Physical material does not determine consciousness. Human consciousness is contained in the physical medium of the brain, but there's nothing special about this material such that another material with similar complexity could not be conscious.

Shockingly, this means that the arrangement of pipes is conscious like we are. I believe our revulsion to this idea is a failure of our intuition. I can't find any logical reason why this system of pipes should be any less, mentally, than we are. After all, lots of unintuitive things are definitively true. Just look at modern physics.

Another argument is that the water pipe system is just a simulation of a brain, which is fundamentally different from a real brain. I disagree. If a robot walks, it's more than just a simulation of walking. An artificial kidney is more than just a simulation of a kidney. A prosthetic limb is more than just a simulation of a limb. So isn't an artificial brain more than just a simulation of a brain? Even if it's made out of wires or water pipes?

To me, the most convincing argument that the pipes are not conscious has to do with neurons themselves. We already know that neurons are much more complicated than simple computer models. Is it possible that neurons have some property required for consciousness that transistors and water pipes can never replicate?[1] I doubt it, because it seems to me that with enough computing power, you can replicate anything. More neurological evidence is needed to know for sure.

In the absence of such evidence, I think the pipe room shows that we aren't that special in our consciousness. ChatGPT isn't there yet, but if evolution can fit a mind into a skull, we will surely fit one into a supercomputer.

Cold Porridge — Comparing ANNs and biophysical neuron modeling using the classic MNIST dataset
LLMs Do Not Predict the Next Word — RLHF forces us to view LLMs as agents in an environment, not just statistical models.

References

The Chinese Room Argument (Stanford Encyclopedia of Philosophy, 2020) ^
How ChatGPT actually works (Marco Ramponi, AssemblyAI, 2022) ^
What Are Word Embeddings for Text? (Jason Brownlee, Machine Learning Mastery, 2017) ^
Extending the C. elegans Connectome to Robotics (Timothy Busbice, 2014) ^

Sixth Sense

2022-11-01T16:38:07Z

Florida may be the lightning capital of the US, but New England has tried to compensate with its doorknobs. It seems every other time I've opened a door in November, I've gotten a static shock. Sometimes I can physically see a flash of lightning from the shock, especially when it's dark.

The coming of cold weather at Brown

North Florida winters have this to some extent as well. Last winter I got in the habit of flicking doorknobs before opening them so I wouldn't have to bother pulling my hand back when the inevitable shock came. Around the same time, in my physics E&M class, there was once a live demo with a Van de Graaf generator. I thought I could feel the electric field in the room as it powered up.

This got me thinking: can humans detect electric fields? Some research has proposed cellular mechanisms for detection of electric fields [1], but this is likely minor compared to our main five senses. On the other hand, I might have an explanation for human detection of electric fields corroborated by a few days of anecdotal evidence.

Feeling Electric

It's well known that the vapor pressure of water is less in cold air. This means that there's less water to keep you electrically neutral as you build up charge by walking around. Something about the material of my jacket seems particularly prone to building up static electricity. One day I set my jacket down indoors and I could feel an electric field emanating from it.

This was the perfect opportunity to figure out why it is that I think I can feel electric fields. As I moved closer to the jacket, I could feel my hairs drawn towards it. This is probably the explanation for that electric, tingling sensation of an electric field. (Some research [2] seems to back up my conclusion.)

After that, when I had to open the door, I felt the same thing. I knew I was in for the shock of my life, so I decided not to touch the doorknob directly. Instead, I held my phone and moved it close to the doorknob. Eventually, I heard a loud click and saw a bolt of electricity rush between the doorknob and my phone. Even though I wasn't touching the doorknob directly, I felt a static shock in my fingers. Once the charges had neutralized, I couldn't feel the electric field anymore. This anecdotal evidence seems to confirm my theory.

The next question is, is it me or the doorknob that has a charge imbalance, or both? Any of those options would explain this effect because charged objects can attract even neutrally charged objects. For example, if you bring a positively charged object near a neutrally charged object, the electrons in the neutrally charged object will be drawn towards the positively charged object. This will create an asymmetry in the neutrally charged object, allowing it to feel an electrostatic attraction.

Another question is, is this an accumulation of positive or negative charge? Kite-flying Benjamin Franklin famously chose our modern charge convention, which is confusing because electrons are negatively charged but do most of the work in moving electricity around. This is a hard question to answer, because the motion of positive charges one way is almost indistinguishable from the motion of negative charges the other way. One way to distinguish them is known as the Hall effect [3]. Unfortunately, even if humans can detect electric fields, I doubt they can detect something like the Hall effect.

The house in Boston where Ben Franklin was born

Feeling Magnetic

In a few languages, such as Guugu Yimithirr in Australia, people don't use prepositions such as “in front” and “to the right of”; instead, they use only the cardinal directions north, south, east, and west [4]. So instead of saying there is a kangaroo behind you, they might say there is a kangaroo to your west. Does this represent an underlying human detection of a magnetic field?

In order for a magnetic field to have an effect, there must be moving charges. One example is water: because oxygen is more elecronegative than hydrogen, the oxygen of a water molecule tends to be slightly negatively charged while the hydrogens tend to be the opposite. This makes a moving water molecule, as in blood or the brain, kind of a mini magnet. This is the principle behind MRI machines.

But can we feel it? While we have hairs to feel electric fields, a magnetic field can only make the water molecules in our body point in a particular direction. People usually go about their day without caring which direction their water molecules are pointing in. It isn't something that can really be felt.

If that's the case, then how do migrating birds detect the magnetic field of the Earth? The exact details have been difficult to work out, but one theory suggests that a pair of charged proteins in the eye act as a magnet which then sends signals to the brain through the optic nerve [5].

Feeling Heavy

Can humans feel gravitational fields? This is a surprisingly deep question. At the physiological level, the answer seems to be a clear yes. Stereocilia in the ear bend with gravity based on the position of your head, allowing you to detect the orientation of your head relative to the ground [6]. This means that you can feel which way is up and which way is down even if you can't see or feel the floor. This really is like a sixth sense because it involves a specialized neurological signaling pathway just like the other senses.

However, if you were falling, the stereocilia in your ear would be falling at the same rate. Because of that, rather than feeling gravity, you would feel weightless. The sense of falling would instead come from air resistance on your feet and from seeing the ground come closer with your eyes.

In fact, such a thought experiment was part of what led Einstein to develop the general theory of relativity. If you are falling in an elevator, you can't tell if the elevator is falling due to gravity or if it's quickly accelerating downwards. This led to the idea that gravity isn't a force at all, and instead is an illusion created by the bending of spacetime. From a biological sensing perspective, the (very indirect) implication is that when stereocilia detect orientation, it's really due to lack of the normal force and not gravity itself.

Feeling Strong

I really don't think there's any way for humans to feel the strong force or the weak force.

Feeling Warm

One thing I know humans can feel is warm weather, and I'm feeling the difference. I'm also glad to have a strong candidate explanation for how humans can feel electric fields. Now that I'm back in hot and humid South Florida, I'll enjoy feeling them less often. The forecast is clear of lightning here, but New England doorknobs have started to feel like stormclouds, waiting for a chance to strike.

Rainclouds in Florida

Testing Copernicus — If all reference frames are valid, can we say that the sun goes around the earth?
Not Quite Pi — How the 2019 redefinition of SI base units affected magnetism and why it annoyed me

References

Scientists explain how you can 'feel' electrical fields (K. G. Orphanides, Wired) ^
Perception of local DC and AC electric fields in humans (Chapman et al., 2005) ^
Hall Effect (C. R. Nave, HyperPhysics, Georgia State University) ^
Guugu Yimithirr Cardinal Directions (John B. Haviland, 1988) ^
How Migrating Birds Use Quantum Effects to Navigate (Peter J. Hore, Scientific American, 2022) ^
Vander's Principles of Physiology (Widmaier, Raff, & Strang, 15^th edition) ^

Wabi-Sabi

2022-07-01T16:38:07Z

Wabi-sabi is notoriously difficult to understand. In general, it's a Japanese ideal of the beauty in impermanence, imperfection, and nature. But it might be better shown than explained.

An example of wabi-sabi in a Japanese garden. The rocks are in their natural irregular shapes, and the stone structure shows signs of age. The imperfections in the display give it character.

I appreciate the simplicity and natural elements of this aesthetic, and I wanted to bring some of that to my room. After moving around some things to make more space, I decided the best way to add some wabi-sabi was a plant.

A money plant, like I could have gotten. I think I did better.

Unfortunately, I can't be watering the plants while I'm away in college. One solution to that problem is to get a money plant, which can be grown in aquariums, so I might be able to just put it in water and let it be. However, I found the plant quite boring. Instead, I wanted to design a system that will water a plant automatically.

Concept

I had done Arduino projects before, most recently a failed sailboat, but I didn't want this one to be purely Arduino. I needed to use my Arduino for future projects while this circuit waters my plant. This was my first project built to last.

I found some online Arduino projects for watering plants, but they used soil moisture detectors, which apparently corrode over time. Instead, I simply went with watering the plant at regular intervals. Since I couldn't use my Arduino, I wanted to make a circuit with no code. I sketched a master plan.

My original master plan

Approximately ten seconds after sketching, I realized my master plan was really bad. Building a pump from a motor would be too hard. Water would spill everywhere, and it probably wouldn't even work. Plants need watering on the scale of days or weeks, but the resistances and capacitances needed for that would be far too large: classroom RC circuits work on the scale of seconds. Finally, I would have no way to make adjustments if the plant was getting too little water or not enough.

Realizing why they invented computers in the first place, I decided to use a computer. Specifically, I found the best chip for my needs would be an ATtiny85. It's tiny as the name suggests, but still able to keep track of my plant's watering needs. The chip can be programmed with Arduino and then detached to use in other circuits, leaving the Arduino untouched.

I also replaced the motor with an actual electric pump. After scrapping my old master plan, I designed a new circuit.

My new master plan, made with circuit-diagram.org.

This circuit had the ATtiny chip outputting signals to a transistor, which would then power the pump at the right times. It also had an LED to give me feedback as I set the watering frequency. I planned to change the settings by manually connecting the input pins to ground: I never said this would be user-friendly. However, the design was simple enough that I expected to be done in about a day.

Code

My plant-watering career was full of frustration from the first step.

I carefully followed instructions on how to program an ATtiny from Arduino, double checked my wiring, triple checked my wiring, tried every combination of tweaking the code, and nothing worked. Eventually I found out I had forgotten a capacitor, causing the Arduino to reset every time I tried to upload code. That's just one paragraph now, but it was almost an entire day of annoyance for me. The only thing that stopped me from giving up was that I was confident in my design, and I knew I had the materials. If it wasn't working, I had to be doing something wrong.

When I finally got it working, I uploaded a program that blinked a light at regular intervals and allowed me to change the frequency of the blinking by adding a wire. At this point, I was almost done: I just had to replace the light with a pump, and make it blink every week instead of every second.

Unfortunately, the chip counted time in milliseconds. You can imagine that gets to a big number after a few weeks. After $2^{32} - 1 = 4.29 \times 10^{9}$ seconds, which is about 50 days, the count resets back to zero. I was worried I would face problems after that reset. Fortunately, the unsigned long data type in C++ works in a way that just as adding 1 to $2^{32} - 1$ gives 0, subtracting 1 from 0 goes back to $2^{32} - 1$ . This let me handle any interval less than 50 days with no issues.

There was another problem with my code: say it's been 6 days since I last watered the plant, and it's set to water every 2 weeks. I want to water every 1 week, but the frequency cycles as 0.5, 1, 2, 3, 0.5, 1... per week, so to get from 2 to 1 I would have to cross 0.5. At this point the plant would be watered immediately, which is too soon. I solved this by not watering within a minute of adjusting settings, and adding an option to increase as well as decrease watering frequency.

Pump

When I swapped out the blinking light for the pump, I found it wasn't working properly. At first, it wasn't pumping with enough force.

I changed some things around and at first it seemed to be pumping with too much force: the water wouldn't stop until it had entirely drained from the box. But what gave away the problem is that this continued even when the pump was unplugged. Rather than a code or electrical issue, this was a fluid mechanics issue.

Bernoulli's principle, which looks a little like conservation of energy, is $\frac{1}{2} ρ v^{2} + ρ g h + P = constant .$

The key part here is that if height $h$ of a fluid decreases, the speed will increase, and vice versa. In other words, gravity was pulling down on the water. Combined with the cohesion of water, this was probably enough to pull the water down the tube even without power from the pump.

To solve that, I had to increase $h$ : I placed the tube at a higher elevation, and made it point up at the end instead of down. That solved the issue, and the pump was now behaving predictably. However, now I needed a way to make sure the pump was at the right height when I got the actual plant. I added a feature to send three short test pumps when the program is set up. That last addition got me to the final version of my code.

Code

(link to GitHub)

unsigned long lastWater = 0;
unsigned long lastAdjust = 0;
const unsigned long HALFWEEK = 1000UL*43200*7;
//10000UL;
int waterMode = 0; // no watering until an interval is chosen
int interval = 0;
int intervals[5] = {0, 1, 2, 4, 6};
bool tested = false;

#define PUMP 0
#define LIGHT 1

void setup() {
  pinMode(PUMP, OUTPUT);
  pinMode(LIGHT, OUTPUT);
  pinMode(3, INPUT_PULLUP); // pins 3 and 4 can be connected to ground for me to interact with the chip
  pinMode(4, INPUT_PULLUP);
  for (int i = 0; i <= 40; i++) {
    analogWrite(LIGHT, i*5);
    delay(30);
  }
  delay(800);
  for (int i = 20; i > 0; i--) {
    analogWrite(LIGHT, i*10);
    delay(10);
  }
  digitalWrite(LIGHT, LOW);
  delay(300);
}

bool timestamp(unsigned long stamp, unsigned long wait) {
  // checks whether or not a number of milliseconds have passed since a timestamp
  bool pastStamp = (millis()-stamp >= wait);
  return pastStamp;
}

void water() {
  // waters the plant and stores the time of watering
  digitalWrite(PUMP, HIGH);
  delay(8000);
  digitalWrite(PUMP, LOW);
  lastWater = millis()-8000;
}
void setInterval(int dm) {
  // sets the interval at which the plant should be watered and displays the new interval on the LED
  waterMode = (waterMode+dm)%5;
  interval = intervals[waterMode];
  if (waterMode == 0) {
    for (int i = 20; i >= 0; i--) {
      analogWrite(LIGHT, i*10);
      delay(30);
    }
  } else if(waterMode == 1) {
    digitalWrite(LIGHT, HIGH);
    delay(250);
    digitalWrite(LIGHT, LOW);
  }
  else {
    for (int i = 0; i < (waterMode-1); i++) {
      digitalWrite(LIGHT, HIGH);
      delay(900);
      digitalWrite(LIGHT, LOW);
      delay(100);
    }
  }
  lastAdjust = millis(); // stores adjustment time so the plant won't be watered immediately
}

void loop() {
  if (tested) // once the three test pumps are done
  {
    if (timestamp(lastAdjust, 2000)) {
      if (!digitalRead(3) && waterMode < 4)
        setInterval(1); // can water less frequently
      else if (!digitalRead(4) && waterMode > 0)
        setInterval(-1); // or more frequently
    }
    if (waterMode != 0
      && timestamp(lastWater, interval*HALFWEEK)
      && timestamp(lastAdjust, 60000)) {
          water();
    }
  } else {
    analogWrite(LIGHT, 90*sin(millis()*0.001)+100);
    if (!digitalRead(3)) { // only start test when I connect pin 3 to ground
      digitalWrite(LIGHT, HIGH);
      for (int i = 0; i < 3; i++) {
        // send three test water pumps to make sure the tube is at the right height
        digitalWrite(PUMP, HIGH);
        delay(2000);
        digitalWrite(PUMP, LOW);
        delay(4000);
      }
      digitalWrite(LIGHT, LOW);
      lastWater = millis()-4000;
      tested = true;
    }
  }
}

Circuit

Once the code and pump were working, I was ready to solder everything onto a circuit board. I took the chip off the Arduino and put all my components on a prototype board.

The circuit board with some components added

The board was arranged in a regular grid, with the components neatly layed out. It was a beautiful aesthetic, but in many ways it was the antithesis of wabi-sabi. Rather than displaying nature and imperfection, this circuit board was the result of precise manufacturing. I thought it might be interesting to have the board next to a plant with all its natural imperfections.

To add the pump, I first soldered the pump wires to some more wires so I could move it farther away from the circuit board. After that I started soldering the actual circuit. It didn't go entirely smoothly: my wire cutter couldn't come close enough to trim the leads properly, and I almost burned my finger fixing the connections on the transistor. But I was able to get it done in the end, and when I hooked it up to power from the Arduino, the green LED lit up.

Plant

Satisfied that my circuit was working, I went to get a plant. I wanted one that looked interesting but didn't need too much water, so Aloe vera seemed like a good choice. There were rows and rows of Aloe plants at Home Depot, all virtually identical. Yet one was very different from the rest, because it would become mine for the forseeable future. I just had to find which one.

The Aloe plant I picked

It would be nice to say I picked one that spoke to me, but they really looked about the same. I picked the one that I thought would fit best. It wasn't perfect: some leaves were straight up, some were sticking out, a few had small brown spots. But the others weren't perfect either, This one was a good size to fit nicely and still command attention, so I got it.

Batteries

Before adding the battery holder, I connected my circuit to power from Arduino to make sure it was still working. Of course, it wasn't working.

To find out the cause, I built a makeshift multimeter with Arduino to see if the connections were good. By this method, I found the LED was bad, and the wire next to it was fine. Then I tried applying power to the LED to see where it stopped working. By this method, I found the LED was fine, and the wire next to it was bad.

At first, I thought physics was broken. There was no way it made sense to have such a contradiction. Eventually, I found the problem was in the connection between the LED and the wire, so I added extra solder and it worked again.

Now I was ready to add the batteries. Along the way, I soldered the positive end in the wrong place at least twice. I had the negative end at the top of the board when I wanted it at the bottom. The LED lit up, showing me that the circuit was working; but every time I tried to send test pumps, it rebooted. Connecting the reset pin to the VCC pin helped a little, but it still wasn't working.

I spent hours trying to fix that problem only to find it was a problem I had faced earlier with Arduino: I needed a capacitor connecting to ground. I soldered in the capacitor and it finally worked.

My final master plan

I tested it on a glass of water to be sure. I must admit, it took me a few tries to get the tube at the right height, but it worked.

Wabi-Sabi

The Aloe definitely added some wabi-sabi to my room. It has natural dents and spots, but they are what give it character, almost to the point that I can't see myself picking any other from the rows and rows of similar plants at Home Depot. I love the way the leaves are irregular in shape and size, but the small spikes are arranged so regularly and neatly on all of them.

That's nice and all, but it was expected. What I didn't expect is how even the circuit has elements of wabi-sabi. It has random wires sticking out from where I couldn't cut the leads close enough. Some parts of the solder bridges are thicker than others, some parts are more spiky than others. All of it is arranged on that same even grid.

My circuit's small imperfections give it character, just like the Aloe. It represents the hard work I put in over the course of a week for what I expected to be a one-day project. I hope it continues to work, because it's not consumer-ready quality. The connections are handmade rather than professional PCB connections. It doesn't have a box hiding the electronic parts, so consumers can't pretend it's just a magic box. But it's an interesting wabi-sabi decorative piece in its own right, one that represents my interests and learnings.

Shipwreck — How to almost build a remote control sailboat using Arduino
Magic Pencil — An attempt to type on Desmos online without touching the computer

Shipwreck

2022-06-01T16:38:07Z

Lots of people make remote control motorboats. Remote control sailboats are less common, although there is such a thing in Central Park. I thought I would make my own remote control sailboat using Arduino. It almost worked. Here is a video of the boat at her peak:

Unfortunately, the boat quickly declined after that. Here is how to almost make a remote control sailboat using Arduino.

Structure

The ill-fated hull

The mast and sail

Materials:

Plastic box
2 empty plastic bottles
2 plastic forks
Piece of paper with holes
Scissors
String
Tape

For the hull, I went with a catamaran structure: I used two plastic bottles for added stability. Feeling confident, I connected the bottles to a plastic box using nothing but tape. I wish I had stronger tape. (I tried school glue as well, but it kept coming apart.)

Next came the mast and sail. For the mast, I used two plastic forks taped together. For the sail, I cut a piece of paper with physics on it. I tied the sail to the mast using bowline knots and then added tape for extra security. (My boat probably relied a little too heavily on tape.) Finally, I cut a small hole in the corner to allow for tightening and loosening the sail later on.

I attached the mast and sail to the hull with yet more tape as well as some more string.

Now I was ready to see if my vessel was seaworthy. With no captain and no ability to steer, I successfully set sail.

The next logical step was to add a captain and an ability to steer. That's where the Arduino came in.

Control

The control circuit

Adding a tiller (fork) for steering and a mainsheet (string) for controlling the sail

Materials:

Arduino
Infrared remote
Infrared receiver module
DC motor
Servo motor

My plan was to control the steering using a servo motor (which can set its angular location to any value 0–180 degrees) and a regular DC motor (which just spins). I would attach the servo motor to a tiller to steer and attach the DC motor to the sail with a rope so I could tighten and loosen the sail. I connected the DC motor directly to the Arduino input pins so I could control which way it spun using code.

Circuit diagram

Made with Tinkercad

Code

#include <IRremote.h>
#include <Servo.h>

#define SERVO 4
#define IR 7
#define MOTOR_1 8
#define MOTOR_2 9
IRrecv ir(IR);
Servo servo;

int pos;

void setup() {
  ir.enableIRIn();
  ir.blink13(true);
  pinMode(MOTOR_1, OUTPUT);
  pinMode(MOTOR_2, OUTPUT);
  servo.attach(SERVO);
  pos = 90;
  servo.write(90);
}

void loop() {
  if (ir.decode()) {
    int button = ir.decodedIRData.command;
    /*
      * up: 9
      * down: 7
      * vol-: 21
      * vol+: 70
      * eq: 25
      */
    if (button == 9) {
      digitalWrite(MOTOR_1, LOW);
      digitalWrite(MOTOR_2, HIGH);
      delay(100);
      digitalWrite(MOTOR_2, LOW);
      delay(100);
    }
    if (button == 7) {
      digitalWrite(MOTOR_1, HIGH);
      digitalWrite(MOTOR_2, LOW);
      delay(100);
      digitalWrite(MOTOR_1, LOW);
      delay(100);
    }
    if (button == 21) {
      pos -= 30;
      if (pos < 0) {
        pos = 0;
        servo.write(0);
      } else {
        for (int i = 29; i >= 0; i--) {
          servo.write(pos+i);
          delay(15);
        }
      }
      delay(100);
    }
    if (button == 70) {
      pos += 30;
      if (pos > 180) {
        pos = 180;
        servo.write(180);
      } else {
        for (int i = -29; i <= 0; i++) {
          servo.write(pos+i);
          delay(15);
        }
      }
      delay(100);
    }
    if (button == 25) {
      pos = 90;
      servo.write(pos);
      delay(200);
    }
    ir.resume();
  }
}

Ship with a Skipper

Now that I had a control system and a supposedly seaworthy vessel, I just had to integrate the two. I tied and taped a string around the DC motor, while I tied and taped a plastic fork to the servo motor.

I flipped the servo motor upside down so the tiller would be in the water, and I tied the mainsheet to the sail. Of course, I secured everything with tape. Finally, I added a 9V battery. With the Arduino, my ship now had a skipper to lead a crew of motors.

However, when I tested it, I found the controls were extremely glitchy. When I pressed a button, it would hardly ever work.

So I began another set of improvements, not knowing it would be my last. I tried compensating for the glitches by programming the Arduino to move further with each button press. I also broke off the fork part of the tiller and pushed it into the bottom for extra steering power. Finally, I changed the angle of the breadboard and moved things around so the infrared sensor would be higher up, and signals would be less likely to get blocked.

The final design

I tested the new control system on land and it worked decently, though it wasn't great.

Shipwreck

I was confused about why the controls seemed to work so much better on land than on water. I was also confused about why the controls were working quite poorly either way. But rather than find the answer, I made excuses like “the sun's infrared rays are interfering” and went on another voyage. As soon as I left the harbor and set the ship on the water, I felt something wrong. The ship was quickly tilting to the right.

My reflexes kicked in and I immediately grabbed the boat to protect the components inside. Unfortunately, there was still a little splashing. The sail and the battery were completely ruined, and I was worried about the rest of the components as well, especially the Arduino.

Electric shipwreck

It could have been far worse: the wreck occured close to shore, so my reflexes were quick enough that damage was minimal. Still, it meant I could go no further with the project. I took everything inside, removed all the tapes and knots, and tested the electronic components to see if they were still working. Fortunately, there was no damage as far as I can tell.

Then I did what I should have done before the shipwreck: I found out why the controls weren't working. The remote seems to have a very small range, only about a foot. Clearly the infrared hardware was designed for things like turning LEDs on and off, not sailing a small vessel across a swimming pool through 10 knot winds.

Lessons Learned

A chain is only as strong as its weakest link. This project had one weak link and one weaker link. First, the infrared remote range wasn't enough to control the boat. This wasn't directly my fault, but I should have tested the hardware more before assuming it would work for my needs.

The weaker link was a literal link between the plastic bottles and the plastic box. I just used a couple pieces of tape, but this is perhaps the most important link in the entire ship.

Without these two issues, I might have actually succeeded, although there were some more minor flaws. The paper sail was attached in a way that it was bent to one side. I should have attached it more carefully or used a boom (solid bar at the bottom of the sail) to keep it straight. The plastic fork tiller was probably too thin. Using thicker rope might have been more secure and made the knots easier to tie.

From a more general perspective, this project taught me about taking risks. I took a risk by using electronic components near water. I paid for that risk. But even though I didn't succeed, I don't regret it. I took a risk hoping it would work out, but prepared for the chance that it might not. In the end, it almost did work out, and I only lost a battery.

At the same time, my risk should have been more calculated. Although I was safe because the battery was only 9V, I knew I was putting my components in danger. I should have tested the strength of the hull and the range of the remote beforehand.

There will probably be many more times in the future where I'll try a project that might fail. Hopefully I'll be more careful next time; but even if I can't completely eliminate the risk, it may or may not still be worth taking a chance.

And most of all, I learned that tape can't solve everything.

Wabi-Sabi — The making of an automatic plant watering system
Magic Pencil — An attempt to type on Desmos online without touching the computer

Ortatious Andith ib Staylatt Neller

2022-03-01T16:38:07Z

There are a few videos on the internet about what English sounds like to people who don't speak it. They often sound something like this:

Stringe canna is like a string paggard hasheter ominent if you think diadout forday and comminent paymin for the trainable.

In other words, the videos make no sense, and they shouldn't — but it always annoys me how you can make out a lot of actual English words, even if it's gibberish. That got me thinking about a way to generate fake English with a computer and avoid actual English words.

I thought of two main ways to do this: manually coding linguistic rules and using a neural network. In this page, I will go through both of these and compare the results. With that, let's start — or as they say in faux English, brind shass!

Hard-Coding Phonotactics

When I first tried to hard-code English, I had trouble finding the resources I needed. In particular, I couldn't find a source on English phonotactics that was detailed enough to write as a computer program. Phonotactics deals with how sounds are allowed to combine in a language: for example, “treels” is a valid combination of sounds in English, even though it isn't in the dictionary. “Gvprtskvni” is definitely not a valid combination in English, but believe it or not, it is an actual Georgian word.

გვფრცქვნი

Gvprtskvni: it means “you peel us” in Georgian.

Fortunately, there is a very well-known and detailed source on Japanese phonotactics: Japanese writing. I first wrote a program to create fake Japanese. This is how it works:

List all sounds that exist in Japanese.
Make rules for all the ways those sounds can combine and make syllables.
Adjust to make some sounds more probable than others [1].
The program will read this data and output fake Japanese.

As a non-Japanese speaker I thought the result was fairly convincing:

/oːri ribonu ku jokin roːpːuwaː. tɕin kisuoko beon ehoːn ri./
おうりりぼぬくよきんろうっぷわあ。ちんきすおこべおんえほうんり。
Ōri ribonu ku yokin rōppuwā. Chin kisuoko beon ehōn ri.

Japanese phonotactics is a lot simpler than English phonotactics though, so I couldn't go straight to English. I tried various languages along the way, referencing Wikipedia phonology pages, but I was a little less satisfied with the results. Note that in the following examples, I had to hand-pick words from the program that sounded best and make some of my own edits instead of taking the results as-is. In Mandarin Chinese, a lot of the syllables the program gave were actual characters. In the other languages, I sometimes had to manually remove real words. Here's (partially) fake Mandarin Chinese:

/nau̯¹i⁴ rɨpjɛn⁴ ʈʂʰaŋ²tɕy² pʰaŋmi ʂan¹mwən¹. tsʰau̯tɕʰiŋ³ lwai̯²tsʰɨ tsən¹sɨ tsjaŋ¹lwan tswai̯⁴./
Nāoyì ripiàn chángjú pangmi shānmūn. Caoqǐng luáici zēnsi ziāngluan zuài.

Here's fake Spanish:

/buˈɲasa ˈela fiˈneɾo pɾiˈnifɾio ʝuˈsela. ˈtʃeɲa luˈposa ˈdjenos setiˈgio raˈdɾaske./
Buñasa ela finero prinífrio llusela. Cheña luposa dienos setiguío radrasque.

Fake French:

/vœ̃ bənɛ̃ lœʒ ʁɔ̃ adɔ̃pan. sekɑ̃ ʒaʁi tuvəble kɥɛl bʁadlyʒe./
Veun benain lœuge ron adompanne. Sécan jarrie touvebler cuelle bradlugée.

Fake Hindi-Urdu:

/kəfmi ɖʱape kʰənːã dənam dʒʊbai. ətʃʰ ɦɪnɔli sɛxa ʈamba ɦoɽʱ./
कफ़मी ढापे खन्नाँ दनाम जुबाई। अछ हिनौली सैख़ा टांबा होढ़।
کَفمی ڈھاپے کھَنّاں دَنام جُبائی۔ اَچھ ہِنَولی سَیخا ٹامبا ہوڑھ۔
Kafmi ḍhaapay khannã danaam jubaai. Achh hinauli saikha ṭaamba hoṛh.

For fake Arabic, I had to include the actual Arabic definite article because it's such a distinctive part of the sound:

/az-zaːtiːɣaː ʕabajb al-kawki radʒr qajziraː. ʃawaː al-majsˤawtaː θun ʕaːxaðˤat al-qaʕasaː./
الزَّاتِيغَا عَبَيب الْکَوکِ رَجْر قَيزِرَا۔ شَوَا الْمَيصَوتَا ثُن عَاخَظَت الْقَعَسَا۔
Az-zātīghā ʻabayb al-kawki rajr qayzirā. Shawā al-mayṣawtā thun ʻākhaẓat al-qaʻasā.

And finally, fake English:

/snaɪʃoʊz bɪp spɛbi jutɛɪθ hoɪbraɪ. hædneɪ sɒtməwɪdʒ sturə rutʃ ɒskeɪl./
Snyshows bip spebby uteith hoibrigh. Hadnay sotmawidge stoora rooch osscale.
Sny-shohs bip speh-bee yoo-tayth hoi-bry. Had-nay sot-muh-wij stoo-ruh rooch oss-kayl.

Here's an unedited fake English to give you an idea of my edits:

/peɪʒi eɪ ni kju leɪ. aʊəaɪk deɪ lɒtʃeɪ snaʊʒibʌs toʊt./
Peigee ay knee cue lay. Owa-ike day lochay snowgebus tote.
Pay-zhee ay nee kyoo lay. Ow-uh-ike day law-chay snahw-zhee-bus toht.

Knee, cue, lay, day, and tote are all actual English words, and the rest of the words somehow seem off to me.

GAN

Next I tried to accomplish the same task using a neural network. My first choice was a generative adversarial neural network (GAN). If you feed in some data (like pictures of people, or English words) the network should return more generated examples of what you put in. It works in two parts: a generator and a discriminator which compete with each other (hence adversarial). The discriminator tries to discriminate between samples made by the generator and real data, while the generator tries to generate fakes that fool the discriminator. In the end, the generator should create such realistic examples that the discriminator can't tell what's real and what's not.

Sample outputs by the authors of StyleGAN. The people in these pictures are computer-generated, not real!

Unfortunately, this didn't work so well for me. Image GANs often use a convolutional neural network. This allows the discriminator to average together nearby pixels in a way that detects things like edges. I wanted my network to consider nearby letters: “tha” is a lot more likely than “gvp”. But what's G times 0.24? Since I was using letters instead of numbers, it didn't make sense.

If not convolution, I couldn't think of a reasonable way to consider nearby letters. I ended up using only Dense layers, one of the most basic types of layer in a neural network, which was clearly not good enough. After a lot of training, with my computer fans going crazy, I got:

??? ? cda c ??
???? ???uhe ?? ??
? ?? a? ggddfi ?? ?
????? ?c bacad ?? ?
?? a?kafejdmd??
a? ? ?l?olmsha?? ?
?????? ??ig?a ?? ??
? ?? bdnhn? a????
??? ??b?bbd b??? ?
???? bahrh?a??? ?

The question marks are where the neural network did not put letter #1 (A), or letter #2 (B), but something like letter #-4 or letter #49. There is a little bit of learning here: it is learning that words go in the middle, with spaces on either side. But I don't really think that's what English sounds like to people who don't speak it.

LSTM

Instead of using a GAN, I found an example online that used a Long-Short Term Memory (LSTM) network to generate text character-by-character [2]. LSTM networks are a type of recurrent neural network (RNN), meaning they can remember previous data in order to put their output in context [3]. This was exactly what I needed to generate fake English, because each letter depends on the letters around it: you can't have a word like “hdjafhkjsdjfh”.

While the original code used segments of 40 characters, I shortened it to 10 characters so it would be focused on generating words rather than sentences. I used the Universal Declaration of Human Rights [4] as input, first with the original text, then with a version transcribed into IPA phonetic characters. At first, the output didn't make much sense:

tdofend zulsance ins of dier ousshand ongantiinco und cous ariasdacpimcoancianl ancor toroglneibais asrarit, cacius or ankecmembecoousion ancen ortdas or macalitdgedd antecpudialitras artticevafs almideacicsiondkssmend actetjone irpoco ho erect pas biymonaitovinagiry alicof onyohperian an onde nceroncaed armandaes of eraqmane anuitaese aaleetandyicvetion the alterangererligeceocicaliandemtityarta

As the neural network learned more about the data, it started coming together:

furthermore, no forth other other bergenuin, by semplesng of las1 his the digrits to hid ortatious in lioged int inservational, touldiag in as chongrald themevees for the united nationaly in a pecils deace or the a the gnien of perserventied. irtist inciperhas arcaring of nock, huraliple for the bidivation social declaration or by law. article 24 everyone has the dignity and social protection of whis ded

At this point, there are some parts that make no sense (“las1”), some parts that sound like possible English words (“inservational”), and some actual English words (“social declaration”). When there are parts that make no sense, the model is clearly underfitting: it isn't able to match the data well enough. But when there are actual English words, the model is overfitting: it is making decisions based on overly specific data points rather than overall patterns in the data. For my purposes, I want to have words that sound possible but aren't real, like “bidivation”.

The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. (Chabacano, Wikimedia, CC BY-SA 4.0)

The overfitting is clear in this case, where the neural network is memorizing the data. But outside of a machine learning context, the whole problem of figuring out what English sounds like to people who don't speak it is closely connected to overfitting. The problem I had with other attempts to tackle this problem is that they used many actual English words, just as my neural network ended up doing. Even my hard-coded phonotactic program had a few English words.

In addition to fake English words, the neural network generated some questionable remarks on human rights. For a network trained entirely on a declaration of human rights, this really highlights the importance of AI ethics...

as to marriage, during marriage shall be subjected to torture or to cruel, indushis s.
everyone shall be held in slavery or servitule
peoples of territories... shall be subject to arbitrary arceslation
everyone has the right to seek... and shall be held guilty of any penal offence
the moral and education shall be arbitrarily deprived of his country, includes freedoms, themselves and among the people
everyone has the right to equal pay for equal write... and the slave trade

Between these dystopian sentences, and the overfitting and underfitting in the model, this RNN showed how human input can still be an important supplement to AI. The same applied to my earlier program, where I had to do some manual edits on top of the computer-generated text. Maybe figuring out what English sounds like to people who don't speak it is just one of the many cases where collaboration between human and computer is the best approach.

References

The GitHub for this project is at https://github.com/crackalamoo/staylatt.

Frequency of occurrence for units of phonemes, morae, and syllables appearing in a lexical corpus of a Japanese newspaper (Katsuo Tamaoka & Shogo Makioka) ^
Character-level text generation with LSTM (Keras) ^
A Gentle Introduction to Long Short-Term Memory Networks by the Experts (Jason Brownlee, Machine Learning Mastery) ^
Universal Declaration of Human Rights (United Nations) ^

2+3=4

2022-02-01T16:38:07Z

I once saw a math joke something along the lines of “if Joe has ten apples in his right hand, and twelve apples in his right hand, what does Joe have in total?” Answer: “really big hands.” While this seems ridiculous, it's definitely not wrong, and I would argue it's even more clever and insightful than it seems at first.

In fact, you could construct an entire self-consistent mathematical system to describe the number of apples Joe can have in his hands. Say Joe can only hold a total of four apples at a time. Then we could write $2 + 3 = 4$ , because if there were 5 apples, one would just fall. This is in fact valid mathematics within this mathematical framework specifically designed to model the apples in Joe's hands.

What does this mean? First of all, it means I have just shown an example of a mathematical framework where $2 + 3 = 4$ is a completely valid statement. This isn't so strange: modular arithmetic is accepted and widely-used, and there you can have $2 + 3 \equiv 1 \mod 4$ .

Second, this tells us something about the relationship between math and the real world. Sometimes people wonder how math describes the physical world so accurately. I am going to argue that this is because math was designed to do this, at least to some extent.

The Universe Can't Do Math

Plants can't talk, but we can talk about plants. The universe can't do math, but we can use math to describe the universe. I'm going to show some examples of this idea.

Bacterial Populations

At least during the log phase, bacterial populations grow exponentially [1]. As an equation, that looks like $P = C e^{k t} .$ How is this possible? Do the bacteria conduct regular censuses and conduct centralized business meetings, making sure to divide at such a rate that the population grows exponentially? Of course that's ridiculous. The underlying cause is that each bacterial cell produces about the same number of offspring in a given amount of time. And the underlying cause of that, in turn, has to do with how long it takes for the bacteria to collect all the things they need to divide in the cell cycle.

A bacterial business meeting to ensure exponential population growth (taken from a transmission electron microscope)

If each bacterial cell produces the same number of offspring in a given amount of time, what does that look like as an equation? It's $\frac{d P}{d t} = k P .$ And if you solve that equation, you get the exponential growth equation above. So it's not that the bacteria do math: instead, the physical world places certain constraints on how the bacteria will behave. Based on those constraints, we can logically deduce what will happen to the bacteria. The way we deduce that is called mathematics.

Coulomb's Law

Newton's well-known law of gravitation states $F = \frac{G m_{1} m_{2}}{r^{2}}$ to describe the force of gravity between two masses. However, since gravity is weird and I do not have a theory of everything, I'll use the similar Coulomb's law from electromagnetism: $F = \frac{k q_{1} q_{2}}{r^{2}} .$ This law is quite surprising. Why should the electric force between two charges be directly proportional to $1 / r^{2}$ ? Do electrons have little calculators and rulers that we can't see, making complex measurements every instant in order to obey this law? Do they take into account all the charges in the entire universe when deciding where to go in an instant?

Coulomb's law applied to the attractive force between a nucleus and an electron. (CC-BY-NC-SA; Kathryn Haas)

Obviously these little electron masterminds are unrealistic. Another theory is to have electric field vectors coming out of charges and spreading out. That way instead of measuring distances to every charge in the entire universe every instant, charges only have to look at the electric field where they are. With Gauss's law, this actually gives the same 1/ $r^{2}$ . Why? Because the surface area of a sphere is proportional to $r^{2}$ .

Derivation of Coulomb's law from Gauss's law

Gauss's law states $\oint E \cdot d A = \frac{Q_{e n c}}{ϵ_{0}}$ Consider a spherical surface around a charge $Q_{e n c}$ . If there are no other charges, the electric field should have the same magnitude all around the surface, and should always be perpendicular to the surface. Therefore $E \cdot d A = E d A$ , where $E$ is the magnitude of the electric field. $\oint E \cdot d A = \int E d A = E A = \frac{Q_{e n c}}{ϵ_{0}}$ Since the surface is a sphere, $A = 4 π r^{2}$ . Also note that the $k$ in Coulomb's law equals $1 / 4 π ϵ_{0}$ . $E = \frac{Q_{e n c}}{4 π r^{2} ϵ_{0}} = \frac{k Q_{e n c}}{r^{2}}$ Finally, the magnitude of electric force on a charge $q$ equals the magnitude of the electric field times $q$ . If we say $Q_{e n c} = q_{1}$ , that gives $F = \frac{k q_{1} q_{2}}{r^{2}}$

Let's go back to the big picture. It seemed at first that the universe somehow knew about inverse square laws and mathematics. In reality, this mysterious law comes down to the surface area of a sphere, and electrons have no mathematical abilities.

Conservation of Momentum

Conservation of momentum is a really interesting physical law. It was first used for classical physics, but it turns out to work for quantum physics too. So does the universe measure the momentum of each particle and add it up to make sure the total momentum is the same?

In fact, conservation of momentum takes place on a smaller scale. You can show that interactions between objects always conserve momentum, and therefore, the momentum of the universe is conserved. One way to do that is with Newton's second and third laws: $F = d p / d t$ and $F_{A B} = - F_{B A}$

But then how can you derive Newton's second and third laws? You can't: they are laws, in the language of math, that describe the physical universe. It's a lot like the differential equation for bacterial populations: mathematics is a tool made to describe the universe, which explains why the universe seems to run on mathematics.

Axioms

If mathematics is just a tool we use to describe the universe, does that make it invented? I think the idea of axioms is helpful here. Euclid proposed five axioms that are at the base of his geometry [2], but other forms of geometry use different axioms and come to different conclusions [3].

If these geometries have different conclusions, how do we know which geometry is correct? Here comes the cop-out answer: it depends. If you are doing geometry on a plane, Euclidean geometry is correct. If you are doing geometry on a sphere, spherical geometry is correct.

But there's another way to find out which geometry is correct. Just check the axioms, and see if they apply to whatever thing you're looking at. This technique should work for anything, not just geometry, as long as the math is valid.

Let's go back to Joe and his apples. Conventional math tells us 2+3=5, but if Joe has 2 apples and we add 3, we find that Joe has only 4 apples since one fell out of his hands. What's the problem? Is math wrong? No, it's just that a fact of conventional mathematics does not apply to the case of Joe and his apples: specifically, the idea that the set of counting numbers is infinite. Because of this, we need to define addition in a different way than in conventional mathematics.

Mathematicians define addition in different ways all the time. Just look at linear algebra: you start with a set of axioms and a definition for addition. If whatever real-world thing you are studying follows all the axioms of linear algebra, then you can use all the methods of linear algebra which have been developed by mathematicians over the years. Some of this conflicts with normal mathematics: for example, $A B - B A$ is not necessarily 0 when you work with matrices.

Richard Feynman noted that in physics, unlike in mathematics, starting with fundamental axioms isn't always the best way to do things [4]. Why is that? I think it's because we need to start with the physical world, then see what axioms it follows, and finally do math assuming those axioms are true.

I think we need to reframe our idea of axioms. Instead of a fundamental truth, axioms are a starting place from which you do further reasoning. If you ever find yourself at the same starting place as some branch of mathematics, you can use the reasoning from that part of mathematics.

So is mathematics discovered or invented? It's hard to tell. I would say that the axioms, the starting points of mathematics, are observed in nature or just invented. From there, the reasoning we do is a process of discovery, not invention.

Finally, I want to emphasize the interplay between reasoning and observation. Mathematics is the process of first picking axioms that appear to be true based on the physical world, and then doing numerical reasoning from there. The first part explains why the physical world seems to follow mathematics: it's the other way around. But that doesn't make mathematics lesser in any way, because the second part is what lets us make accurate predictions about the real world. Reasoning is what makes mathematics so powerful.

Connecting Covariance and Rotational Inertia — A surprisingly deep connection between statistics and classical mechanics
Speedrunning Calculus — I did a “speedrun” of my Calculus 3 class.
Romans Needed Quantum Computers — A history of information density, from the ancient Mayans to quantum computing

References

Growth of bacterial populations (Britannica) ^
Axioms and Postulates of Euclid (The Elements of Euclid translated by Sir Thomas Heath) ^
The Three Geometries (EscherMath) ^
The Relation of Mathematics to Physics (Richard Feynman, Youtube) ^

Mathematical History

2021-12-01T16:38:07Z

I found the following equation in a paper:

C = - \sum_{i = 1}^{N} P (i) \sum_{j = 1}^{N} P (j | i) \log P (j | i)

What was this paper about? Conditional probability? Entropy and statistical thermodynamics? Some other field of math or physics?

In fact, the paper was about deciphering the Indus script from thousands of years ago. But it did borrow ideas from both conditional probability and statistical thermodynamics. Such methods are actually quite common now in deciphering ancient scripts: this paper uses the same idea for symbols used in Iron Age Scotland.

The Indus script

When you think about applied mathematics, history is probably not the next field that comes to mind. You might think of physics, computer science, engineering, or chemistry.

But these days, it feels like math is extending into every field. You can't learn economics without supply and demand curves. Finance and psychology are full of data and statistical analysis. There's an entire subfield of mathematical biology as our understanding of biological systems improves. So why not mathematical history?

History is interesting in itself, but it also has a lot of questions that are useful and practical. What can Easter Island teach us about the effect of population size and resource use on a society? What does the Industrial Revolution mean for how new technology can reduce poverty? How does the Islamic Golden Age relate to global scientific collaboration and exchange of ideas in the Information Age?

These are just a few important questions from history that math, especially statistics, can help us to answer more thoroughly.

Small Example Question

This is the kind of question I imagine you might find for homework in a mathematical history class.

Collapse of States

You are given the durations of various empires in history [1].

Use software to construct a probability distribution for these durations. What type of distribution is it?
Based on this data alone, given that the United States has lasted for about 250 years since 1776, in which year is the United States expected to collapse?
Give one historical or statistical reason why this probability distribution may give a biased estimate for the collapse of the United States.

I took the data into Python and plotted it using matplotlib. Here's the histogram I got.

Histogram for number of years an empire will last

To me, this looks like an exponential distribution. We can interpret this as saying that at any given time, the probability of an empire collapsing is constant, similar to radioactive decay. This is a meaningful historical interpretation that comes straight out of a statistical model.

Now, in principle, we can evaluate the following integral to determine the year $Y$ in which the United States is expected to collapse. $Y = 1776 + \int_{250}^{\infty} k t e^{- k t} d t$ Take a moment to think about how strange that is: an integral telling you when the United States will collapse! However, since the data is discrete, it's easier to just analyze the data directly than to try to extract a value of $k$ . We can simply look at all the empires that lasted at least 250 years and find the average duration of those. This gives 550 years. So the year in which the United States is expected to collapse based on our model is 1776+550=2326.

Of course, this isn't a perfect model. One issue I can think of is that the list generally includes empires and dynasties, which might last shorter or longer than relatively democratic countries such as the United States. Another is that many of these empires were from long ago, and there might be characteristics of the modern era that make states last for more or less time.

Big Example Questions

I think there's a lot of potential for this idea of mathematical history. Here are some cases where that might be useful.

Industrial Revolution

The Industrial Revolution is a great example of the impact of technology on society. It had many positive impacts: poverty was greatly reduced, and new technology increased quality of life. On the other hand, it led to a global divide between industrialized and non-industrialized countries, and increased polution which was bad for health and the environment.

The Industrial Revolution is neither the first nor the last instance of technology drastically changing society. A lot of the questions about the positive and negative effects can benefit from a mathematical approach: we might want to know if there was a statistically significant change in political systems or political stability in industrialized nations. Math can help quantify how even or uneven the reduction of poverty was both in individual societies and around the world.

We might also want to look at the Industrial Revolution in the context of other big changes in history, like the development of agriculture, and try to quantify these changes. We can also look at how frequent these changes are over time.

Imperialism and Global Influence

The best known and largest example of imperialism is European imperialism in the early modern period. However, there are a lot of examples of imperialism before that, like the Mongol empire. Since then, soft power in today's world powers arguably has some similarities.

It might be interesting to quantify the effects of global influence on both conquering and conquered nations. These effects would be functions of many variables (sounds like multivariable calculus) like the amount of direct or local rule, geographical distance between countries, or time period. Looking at these functions could tell us something about soft power today and the conflict between the US and China.

Future of Mathematical History

I don't think much thought has been given to the idea of mathematical history yet, so it'll take some time to develop a proper theory and apply it to important questions. But Isaac Asimov's idea of psychohistory from Foundation is similar. In the real world, some cool sources like Our World in Data have articles (like this one) that I would say fall under mathematical history. I'm curious to see how this idea develops in the future, and I think a lot of important results can come out of it.

Predicting Every Election since 1916 — How “election pundit predictions” betray a misunderstanding of probability

References

List of empires (Wikipedia) ^

Entropic Evidence for Linguistic Structure in the Indus Script (RPN Rao et al., 2009)
Pictish symbols revealed as a written language through application of Shannon entropy (Lee, Jonathan & Ziman, 2010)

What is Schrödinger's Equation?

2021-11-01T16:38:07Z

In classical mechanics, arguably the most important equation is Newton's famous $F = m a$ It's so simple, it almost feels silly putting it on a separate line in the middle of the screen. But it deserves the spot, because it's so central. With this equation, you can take information about a classical mechanics system and figure out how it will change for all time.

In quantum mechanics, there is a similar equation used to predict the behavior of quantum systems. It is Schrödinger's famous

i ℏ \frac{\partial}{\partial t} Ψ (r, t) = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ (r, t) + V (r, t) Ψ (r, t)

Schrödinger is back row, sixth from the left (or right) in this legendary picture

Putting this on a separate line doesn't feel silly at all. Surprisingly, however, this equation isn't too far off from the principles of classical mechanics and $F = m a$ . Today I am going to show you how to understand this equation a little better using classical physics and lots of math. You will need classical mechanics and calculus, but no quantum background. There are just a few facts about quantum physics you need to accept first:

Quantum entities can have energy from their frequencies according to the equation $E = h f$ , where $E$ is energy, $f$ is frequency, and $h$ is a constant called Planck's constant.
The wavelength $λ$ of a quantum entity depends on its momentum $p$ according to the de Broglie equation, $λ = h / p$ . You can actually derive this quite easily from $E = h f$ and $E = m c^{2}$ .
With quantum mechanics, we don't look at the exact positions or velocities of particles. Instead, we use something called a wavefunction. With wavefunctions and some calculus, we can find the probability distributions of position, momentum, and more, but not certain values.

It's important to note that this does not represent a true derivation: it would be easier to just accept Schrödinger's equation directly rather than accept the above facts and go through all this math. However, hopefully this will give some intuition into Schrödinger's equation and the math behind it.

Classical Basis

You might say “Schrödinger's equation looks nothing like Newton's equation, how can they be analogous?” This is a fair point. In fact, Schrödinger's equation is a little more analogous to conservation of energy. (More precisely, it's based on a Hamiltonian, not a true conservation of energy equation.) The equation for classical conservation of energy, where we'll start, is $E = K + V$ Where $K$ is kinetic energy, $V$ is potential energy, and $E$ is total energy. Doesn't this already look a little like Schrödinger's equation? We have one term on the left, and it's the sum of two terms on the right.

From here, we're going to modify this equation step by step until we end up with Schrödinger's equation. First of all, we could write $K$ in terms of mass $m$ and velocity $v$ . $E = \frac{1}{2} m v^{2} + V$ It turns out that in quantum mechanics, the momentum $p = m v$ will be more helpful to us than the velocity. Luckily, we can write $K$ in terms of mass and momentum as well.

K = \frac{m v^{2}}{2} = \frac{m^{2} v^{2}}{2 m} = \frac{(m v)^{2}}{2 m} = \frac{p^{2}}{2 m}

$E = \frac{p^{2}}{2 m} + V$ Now we are going to transition from classical mechanics to quantum mechanics. To do this, we are going to multiply by the quantum wavefunction $Ψ$ on both sides, just to get it into our equation. $E Ψ = \frac{p^{2}}{2 m} Ψ + V Ψ$ But at this point, our equation is a bad mix of classical and quantum mechanics that doesn't really make sense. We said that we don't deal with exact values of momentum in quantum mechanics, only probability distributions with our wavefunction $Ψ$ . But here we have $p$ and $Ψ$ in the same equation, as if we knew exactly what the momentum $p$ was.

We might not know what $p$ is, but it turns out we can change $\frac{p^{2}}{2 m} Ψ$ to something in terms of $Ψ$ and things we do know, so we can deal with probabilities like we're supposed to.

Operators

We don't really know what $Ψ$ is, since we are keeping it general, but we can write it in general terms. How about this: $Ψ = A e^{i k_{x} x} e^{i k_{y} y} e^{i k_{z} z} e^{- i ω t}$

Why is only time negative?

This is a tough question, and I couldn't find a satisfactory answer online, but here's one way I found to think about it. From multivariable calculus, we have $\frac{d x}{d t} = - \frac{\partial Ψ / \partial t}{\partial Ψ / \partial x}$ If we had $e^{i ω t}$ instead of $e^{- i ω t}$ , this would become $\frac{d x}{d t} = - \frac{i ω}{i k_{x}} = - \frac{ω}{k_{x}}$ Later, we'll find that $ω = 2 π f$ and $k_{x} = 2 π p_{x} / h$ . This gives $\frac{d x}{d t} = \frac{p_{x}}{m} = - \frac{2 π f}{2 π p_{x} / h} = - \frac{h f}{p_{x}}$ $\frac{p_{x}^{2}}{m} = - h f$ But this means frequency is negative, or momentum is imaginary, obviously both making no sense. Therefore, we must have opposite signs for space and time.

This equation doesn't really tell us much about $Ψ$ . We have no idea what the values of these variables are. That's actually a good thing, because we don't have information about $Ψ$ , so we don't want to pretend like we do and make stuff up.

All we are saying is that $Ψ$ is some number $A$ times something like $e^{i θ}$ a bunch of times. Each $e^{i θ}$ term shows that the wave function depends on something in some way: $x$ , $y$ , and $z$ for position in three dimensions, and $t$ for time. In other words, our equation translated to English is just saying “the wavefunction depends on space and time in some way.”

There is one more thing the equation is saying. $e^{i θ}$ is an oscillating function, so our wavefunction will oscillate like a wave.

How do we know it's a wave of this form? Well, actually we don't. But it turns out that if you have some solutions to the Schrödinger equation, their sum (more precisely, their linear combination) will also be a solution. Also, there is something called the Fourier transform which says that you can write any function as a sum of sine and cosine functions (waves). Putting these two ideas together, if we can derive the Schrödinger equation for a general wave, we can add waves together to make whatever other function we want. This sum will also be a solution since it's the sum of individual solutions.

Kinetic Energy

So we have a wavefunction, and it's a wave. We might be interested in the wavelength (in space) and frequency (in time) of the wave. If we have $e^{i (a) θ}$ , the “wavelength” would be $2 π / a$ , since we make a full circle back to $e^{0 i} = e^{2 π i}$ once $θ$ reaches $2 π / a$ . That means for $e^{i (k_{x}) x}$ , our wavelength $λ_{x}$ is $2 π / k_{x}$ . The same idea applies to $k_{y}$ and $k_{z}$ , for the wavelengths in the $y$ and $z$ directions.

But wait, remember the de Broglie equation? $λ = \frac{h}{p}$ This means that if we have the wavelength for each direction, we can easily find the momentum in that direction. $p = \frac{h}{λ}$ $p_{x} = \frac{h k_{x}}{2 π}, p_{y} = \frac{h k_{y}}{2 π}, p_{z} = \frac{h k_{z}}{2 π}$ Let's define a new constant, $ℏ = h / 2 π$ , just to clean things up a little. $p_{x} = ℏ k_{x}, p_{y} = ℏ k_{y}, p_{z} = ℏ k_{z}$ There's one problem: we have no idea what all these $k$ values are. I made them up when we wrote a general equation for $Ψ$ . But something interesting happens if we take the second derivative of $Ψ$ . We find $p_{x}^{2} Ψ = - ℏ^{2} \frac{\partial^{2} Ψ}{\partial x^{2}}$

Proof

Ψ = A e^{i k_{x} x} e^{i k_{y} y} e^{i k_{z} z} e^{- i ω t}

\frac{\partial^{2} Ψ}{\partial x^{2}} = (i k_{x})^{2} A e^{i k_{x} x} e^{i k_{y} y} e^{i k_{z} z} e^{- i ω t} = (i k_{x})^{2} Ψ = - k_{x}^{2} Ψ

Ψ = - \frac{1}{k_{x}^{2}} \frac{\partial^{2} Ψ}{\partial x^{2}}

p_{x} = ℏ k_{x} ⟹ \frac{1}{k_{x}} = \frac{ℏ}{p_{x}}

Ψ = - \frac{1}{k_{x}^{2}} \frac{\partial^{2} Ψ}{\partial x^{2}} = - \frac{ℏ^{2}}{p_{x}^{2}} \frac{\partial^{2} Ψ}{\partial x^{2}}

p_{x}^{2} Ψ = - ℏ^{2} \frac{\partial^{2} Ψ}{\partial x^{2}}

For the $y$ and $z$ components of momentum, we'll have almost the same equation, just replace $x$ with the new letter. For the total momentum, we have to add the squares of each component: $p^{2} = p_{x}^{2} + p_{y}^{2} + p_{z}^{2}$ Now we can divide by $2 m$ and multiply by $Ψ$ on both sides, then plug in the equation for momentum in each component with the second derivatives.

\frac{p^{2}}{2 m} Ψ = - \frac{ℏ^{2}}{2 m} (\frac{\partial^{2} Ψ}{\partial x^{2}} + \frac{\partial^{2} Ψ}{\partial y^{2}} + \frac{\partial^{2} Ψ}{\partial z^{2}})

If you know multivariable calculus, you might recognize the Laplacian operator, $\nabla^{2}$ , in here. If not, just consider $\nabla^{2}$ to be a special abbreviation for the sum of all these second derivatives.

$\frac{p^{2}}{2 m} Ψ = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ$ Remember earlier when we were working with classical mechanics? We said that $p^{2} / 2 m$ was kinetic energy. In our new quantum formula, we are using $- \frac{ℏ^{2}}{2 m} \nabla^{2}$ on our wavefunction to get the term corresponding to kinetic energy. Therefore, we say that $- \frac{ℏ^{2}}{2 m} \nabla^{2}$ is the operator for kinetic energy. Now let's look back at our old equation, where we just took a classical equation and multiplied by $Ψ$ .

$E Ψ = \frac{p^{2}}{2 m} Ψ + V Ψ$ Now with our operator, we have the Schrödinger equation. $E Ψ = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ + V Ψ$ This is already a valid form of the Schrödinger equation if we know our value for the energy $E$ . But we can expand this with more math.

The momentum operator

There are many forms of the Schrödinger equation. When we transitioned from classical to quantum mechanics, we had a bad mix of classical momentum $p$ and quantum wavefunction $Ψ$ : $E Ψ = \frac{p^{2}}{2 m} Ψ + V Ψ$ Just like we are using the kinetic energy operator $- \frac{ℏ^{2}}{2 m} \nabla^{2}$ to get the kinetic energy term, we could also use the momentum operator $\hat{p}$ . This would be $E Ψ = \frac{{\hat{p}}^{2}}{2 m} Ψ + V Ψ$ and with that one simple change, this is another valid form of the Schrödinger equation. It turns out that the momentum operator is equal to $- i ℏ \nabla$ . You can plug that in and see that it works out to the same thing as with our kinetic energy operator.

Total Energy

Remember I said it might be interesting to know the wavelength and frequency of the wavefunction? We tried finding the wavelength, and ended up coming much closer to the Schrödinger equation. But let's not forget the frequency! Now let's find the frequency and hope we come even closer. Let's go back to the general wavefunction. $Ψ = A e^{i k_{x} x} e^{i k_{y} y} e^{i k_{z} z} e^{- i ω t}$ The frequency in time will be based on the $e^{- i ω t}$ term. Specifically, the frequency will be $f = ω / 2 π$ , for similar reasons as the wavelength $2 π / k$ . Let's see what happens if we take the first derivative with respect to time.

\frac{\partial Ψ}{\partial t} = (- i ω) A e^{i k_{x} x} e^{i k_{y} y} e^{i k_{z} z} e^{- i ω t} = (- i ω) Ψ

f = ω / 2 π ⟹ \frac{\partial Ψ}{\partial t} = (- i \cdot 2 π f) Ψ

But wait, remember $E = h f$ ? That means if we have the frequency, we can easily find the energy. If we plug this into our earlier Schrödinger equation, this gives us $i ℏ \frac{\partial Ψ}{\partial t} = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ + V Ψ$ There, that's the full Schrödinger equation!

Show steps

We just had

\frac{\partial Ψ}{\partial t} = (- i \cdot 2 π f) Ψ

f = \frac{E}{h} ⟹ \frac{\partial Ψ}{\partial t} = (- i \cdot 2 π E / h) Ψ

Solve for $E Ψ$ .

E Ψ = - \frac{h}{2 π i} \frac{\partial Ψ}{\partial t} = i \frac{h}{2 π} \frac{\partial Ψ}{\partial t}

Remember we defined $ℏ = h / 2 π$ .

E Ψ = i ℏ \frac{\partial Ψ}{\partial t}

Let's plug this back into our equation.

E Ψ = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ + V Ψ

i ℏ \frac{\partial Ψ}{\partial t} = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ + V Ψ

When I presented the Schrödinger equation at the beginning of this post, I did one more step to make it a little extra scary. We can consider position, which we can write as a vector $r$ . The wavefunction and potential energy can then be functions of both position $r$ and time $t$ . That gives

i ℏ \frac{\partial}{\partial t} Ψ (r, t) = - \frac{ℏ^{2}}{2 m} \nabla^{2} Ψ (r, t) + V (r, t) Ψ (r, t)

What is Schrödinger's Equation?

So far, we've shown how to find Schrödinger's equation with classical mechanics, but I haven't really explained what the equation is like I promised in the title. Here are some thoughts about what it all means.

As a summary of what Schrödinger's equation is, you can think of it as a statement of conservation of energy in quantum mechanics. One big difference is that it is a probabilistic equation, since it tells you about the wavefunction. The wavefunction can help you predict what a particle is doing, but you can never be completely sure. In classical mechanics, you can be completely sure, at least if your model is right. (Another difference is that it's actually dealing with the Hamiltonian, which isn't always conserved.)

It's interesting to see that we played a lot with the total energy $E$ and the kinetic energy $K$ in the transition to quantum mechanics, but the potential energy $V$ is still just written as $V$ . This makes some sense, since $V$ really depends on the situation, while $E$ and $K$ are properties of the particle itself.

You must be wondering, why is the Schrödinger equation so much more complicated than $E = K + V$ ? Well, it doesn't have to be. It just is that way because it's more explicit about the quantities we need. If you want the simple version, you can write $E Ψ = \hat{H} Ψ$ Where $E$ is energy, and $\hat{H}$ is called the Hamiltonian operator and equals $- \frac{ℏ^{2}}{2 m} \nabla^{2} + V$ . These forms are equivalent, since we earlier proved that $E Ψ = i ℏ \partial Ψ / \partial t$ . This form is called the time-independent Schrödinger equation because there's no mention of time $t$ , while the one from before is the time-dependent Schrödinger equation.

It seems like conservation of energy has a simple version in both classical physics and quantum physics. We just did a lot of work with a complicated analog of conservation of energy in quantum phyiscs. Is there an analog to this in classical physics?

In general, for a conservative system, the Hamiltonian represents the sum of kinetic and potential energy. So in classical physics, we have $E = K + V = H$ $E = \frac{p^{2}}{2 m} + V$ From here, we can't go further without more information, so we could say that this is the classical analog of the complicated Schrödinger equation.

But if we know more about the situation, we can make this more complicated. Maybe we know the potential energy is from gravity, and maybe we know the initial energy was all gravitational potential energy. Then we can have

g h_{0} = \frac{1}{2} [(\frac{d x}{d t})^{2} + (\frac{d y}{d t})^{2} + (\frac{d z}{d t})^{2}] + g z

Proof

$E = \frac{p^{2}}{2 m} + V$ $m g h_{0} = \frac{1}{2 m} p^{2} + m g z$ $m g h_{0} = \frac{1}{2 m} (p_{x}^{2} + p_{y}^{2} + p_{z}^{2}) + m g z$ $m g h_{0} = \frac{1}{2 m} (m^{2}) (v_{x}^{2} + v_{y}^{2} + v_{z}^{2}) + m g z$ $m g h_{0} = \frac{m}{2} (v_{x}^{2} + v_{y}^{2} + v_{z}^{2}) + m g z$

g h_{0} = \frac{1}{2} [(\frac{d x}{d t})^{2} + (\frac{d y}{d t})^{2} + (\frac{d z}{d t})^{2}] + g z

From there, we can use something called Hamiltonian mechanics to see how the system will evolve, without needing Newton's laws.

This is somewhat complicated, like the more complicated version of the Schrödinger equation. It also has derivatives, which can help us find how the system will change with time. The moral of the story is that given the basic idea of the Hamiltonian $E = H$ , and some information specific to the situation, we can plug in that information to get a more complicated but more useful equation. It's interesting how this works for both quantum and classical physics.

The Schrödinger equation looks scary, and it is a little scary, but it's also meaningful. Just like conservation of energy in the Hamiltonian form can help us tell what will happen in classical mechanics, the Schrödinger equation tells us what will happen to a wavefunction in quantum mechanics.

There's a lot more to this, like how exactly we use wavefunctions, and what a wavefunction is. There are many questions that still don't have answers, like what it means that all this seems probability-based. But considering that this equation is at the heart of our universe (ignoring relativity, or until someone finds a better one that explains quantum gravity or something) it's interesting to know that it's related to classical conservation of energy, which even introductory physics students know about.

Playing with Sunshine — A simple interactive model for regional and global climate and greenhouse gases using the Stefan-Boltzmann law
Testing Copernicus — If all reference frames are valid, can we say that the sun goes around the earth?
What is Time? — Unraveling the concept of time from a physics perspective

References

Quantum Physics I (B. Zwiebach, MIT)
Hamilton’s Equations of Motion (Jeremy Tatum, University of Victoria)

Testing Copernicus

2021-10-15T16:38:07Z

In school, we are taught that the earth goes around the sun. This is correct. I'm not trying to say heliocentric theory is wrong. (A physics major disagreeing with heliocentrism is not a good look at all.)

I'm not trying to say heliocentric theory is wrong, but (this sounds bad so far) I am going to look at one way in physics that geocentric theory could be correct too. (Could be, but it's really not. I'm not saying it is.)

Now that we have firmly established that I believe the earth and all the planets go around the sun, not the other way around, I think I can safely start my critique of heliocentrism.

There are aspects of this that have to do with the rotation of the earth, the Coriolis effect, and the theory of relativity. I will neglect these ideas, as they are topics in themselves, and aren't really necessary to look at heliocentrism compared to geocentrism.

Frame of Reference

From the sun's perspective it is clear that the earth goes around the sun. If people could live on the sun, they would look up at the fiery sky and observe the planets rising and setting, just like we observe the sun rising and setting here on Earth.

But why is the earth's perspective any less valid? After all, physics dictates that there is no privileged frame of reference and the laws of physics are the same in all inertial reference frames.

It turns out that while the perspective on the sun is an inertial reference frame, the perspective on Earth is not. This is the reason why heliocentrism is accepted while geocentrism is not.

For our purposes today, an inertial reference frame is one that follows Newton's first law of motion, or one in which real forces are the only way in which things can accelerate. If you don't see how a reference frame might break Newton's first law, consider this thought experiment.

You and I are in space. First, my perspective: I believe that I am stationary, and I see that you are stationary as well. Then, you use a jetpack to produce a force on you that accelerates you at 1 m/s². Does this follow Newton's first law of motion?

As for me, I am at rest, and I remain at rest because I am not acted upon by an outside force. As for you, you start at rest, but you are then acted upon by an outside force to accelerate you. This all follows Newton's first law of motion, meaning my frame of reference is inertial.

But now let's think about your perspective. You start at rest, and you see me at rest. But then, in your perspective, I start to accelerate towards you at 1 m/s² when you start your jetpack. It's like when you're in a car, it seems like you are sitting still while everything else moves past. Does this follow Newton's first law of motion?

As for me, I accelerate from rest, but there is no force on me. This doesn't make sense according to Newton's first law. As for you, you remain at rest, even though the jetpack exerts a force on you. This doesn't make sense either. How do we deal with this?

Since this is a non-inertial reference frame, we can solve the problem by creating a fictitious force. In this case, we will need a fictitious force to accelerate both you and me by 1 m/s² in the opposite direction of the force on you due to the jetpack. Now everything makes sense: this fictitious force cancels out with the force of the jetpack on you, so you remain at rest. This fictitious force is the only force acting on me, so it causes me to accelerate at 1 m/s².

Even though we now have a reference frame that follows the laws of physics, we needed to invent a fictitious force to get there. This fictitious force is fictitious, as the name implies. If we need a fictitious force, the frame is non-inertial.

Now, why do we need a fictitious force to produce a geocentric theory?

Comparing Heliocentric and Geocentric Models

Heliocentrism

I'll start with a heliocentric model. Let's make things very simple so we can focus on the differences between heliocentrism and geocentrism rather than on the details of orbital mechanics. We'll only consider the earth and the sun, and we'll assume the earth's orbit is circular.

Then the earth is at a constant distance $R$ from the sun. The acceleration of the earth is given by $G M / R^{2}$ , where $M$ is the mass of the sun and $G$ is the gravitational constant. Since the mass of the sun is much greater than the mass of the earth, we can take the acceleration of the sun to be 0. Now we will ask the same question: is this an inertial frame of reference? Does this follow Newton's first law?

Looking at the earth, it feels a force $G M m / R^{2}$ from the sun, where $m$ is the mass of the earth. Based on this, the acceleration of the earth is $G M / R^{2}$ . We will say that this is just the acceleration we need in order to keep a stable orbit at this speed. (We can do this with basic algebra-based physics.)

Proof: keeping a stable orbit

Let's say the earth orbits the sun at a speed

v

, and continues this speed in its circular orbit. In order to do so, the earth needs the right amount of centripetal force. This centripetal force must be provided by gravity, the only force on the earth. Therefore, we can set the force due to gravity and the required centripetal force equal to each other.

\frac{G M m}{R^{2}} = \frac{m v^{2}}{R}

v = \sqrt{\frac{G M}{R}}

So as long as the earth keeps going at

v = \sqrt{G M / R}

, it can keep its circular orbit stable.

Since $G M / R^{2}$ is both the acceleration calculated for earth based on the forces it experiences, and the acceleration it must be observed to experience in order to keep a circular orbit, Newton's first law is followed: the earth is an object in motion that is being acted upon by an outside force and responding predictably to that force. This is analogous to the space example: the acceleration we would calculate for you because of your jetpack, is also the same acceleration I observed in my frame of reference. Therefore, this is an inertial frame of reference.

The acceleration of the sun is calculated as $G m / R^{2}$ . We can assume this is very small, near 0, since $G$ is very small while $R$ is very big. Since our frame of reference is centered on the sun, this should be 0, so this makes sense. (In order for it to add up perfectly, we would need a frame of reference centered on the barycenter of the sun and earth, but this is close enough to the sun that we can say a heliocentric model works.)

Geocentrism

Now let's shift to a geocentric model. Since we are centering our frame of reference on the earth, the acceleration of the earth should be 0. The sun is going around the earth at the same speed $v = \sqrt{G M / R}$ that the earth went around the sun in the heliocentric model, and at the same distance $R$ . This means that the centripetal acceleration needed for the sun is equal to $G M / R^{2}$ , which is what was needed for the earth before.

Is this an inertial frame of reference? Let's look at the sun. Imagine you are an observer on earth with access to the position of the sun, its speed, its mass, its distance, the mass of earth, all these relevant variables. What would you see if you watch the sun?

You would see it rise and set as it orbits around the earth. Based on its observed speed $v = \sqrt{G M / R}$ and distance $R$ , you can calculate its acceleration to be $G M / R^{2}$ . Does this acceleration make sense? The only force on the sun is gravity from the earth, with magnitude $G M m / R^{2}$ , meaning you would predict the acceleration of the sun to be $G m / R^{2}$ . This is much less than the $G M / R^{2}$ you observe.

As for the earth, it experiences the same force $G M m / R^{2}$ , predicting an acceleration $G M / R^{2}$ . But since you are on earth, it seems to you that the acceleration is 0. This is off too. Therefore, your frame of reference is not inertial.

We can resolve this by creating a fictitious force. Remember that we just found the predicted acceleration of the sun to be $G m / R^{2}$ . Let's approximate this to be 0. (We are doing this because we will neglect any difference between the barycenter of the solar system and the center of the sun.) The acceleration for the sun that we observe in a geocentric model is $G M / R^{2}$ , so we will apply this acceleration to the sun in the direction towards the earth.

The earth feels a pull to the sun, but we find that our new fictitious force opposes this. An acceleration vector of $G M / R^{2}$ from the sun towards the earth, if applied to the earth, becomes an acceleration vector of $G M / R^{2}$ in the direction opposite the sun. It's easy to see how this will cancel the force of gravity on the earth due to the sun.

So by creating a fictitious force that produces an acceleration $G M / R^{2}$ in the direction from the sun to the earth, we have now built a working geocentric model.

Of course, this fictitious force isn't real. That's why we say geocentrism isn't real. Since we need a fictitious force to account for geocentrism, it makes more sense to use a heliocentric model.

Barycentrism?

I said that the acceleration of the sun due to the earth's gravity is $G m / R^{2}$ , and I approximated this to be 0. What if I hadn't done this approximation?

Then we would find that a strictly heliocentric model actually needs a slight correction with a fictitious force, similar to what we found with a geocentric model. The acceleration of the sun should be 0 in a heliocentric model, but this doesn't match up with the fact that there should be a (slight) acceleration of the sun due to gravity. We need a fictitious force to fix this.

So if geocentrism needs a fictitious force, and heliocentrism needs a fictitious force, which model is really correct? In order to really have no fictitious force, we need to set a frame of reference centered at the center of mass of the solar system (called the barycenter) rather than the center of the sun. Since the sun's mass is such a huge piece of the solar system's mass, it turns out that the actual barycenter of our solar system is inside the sun. But the mass of the other stuff is there too, so the barycenter isn't exactly at the middle of the sun.

In some systems, such as binary star systems, the barycenter might not be inside any of the bodies. For example, if two stars of equal mass orbit each other, their barycenter is right between them, not inside either one. In this case, we would set up a coordinate system at this barycenter in order to avoid fictitious forces. This makes a lot of sense for a system of binary stars, and might help see why it applies even to our solar system, since it can be unintuitive at first to center a model around a point slightly off balance from the center of the sun. We might think at first that we have to pick a particular body to center our system around, like the earth or the sun or Jupiter, but the binary star example shows that the barycenter is really what we need.

Conclusion

Based on the idea that all frames of reference are valid in physics, I tried making a geocentric model of the solar system. But then we found that I needed a fictitious force for the model to work, so this isn't a very good model. We found the heliocentric model to be better.

But taking a closer look at the heliocentric model, we realized it depends on approximation. If we remove this approximation, we end up with a barycentric model, centered at the center of mass of the solar system. Since the sun has such a big piece of the mass of our solar system, we usually approximate this to say that heliocentrism is the true model. But in reality, barycentrism works a bit better, although it is only slightly different from heliocentrism.

Using the idea of barycentrism, you can tell your friends that you don't believe in heliocentrism. Then, when they call you a conspirator and ask if you're a flat-earther too, you can explain why a barycentric model is technically superior to both heliocentrism and geocentrism for the same reason that heliocentrism is super to geocentrism: it cuts out fictitious forces. I don't know if this is actually the kind of stuff you tell your friends, but if it is, I hope you enjoy.

Romans Needed Quantum Computers

2021-10-01T16:38:07Z

Throughout history, we have always tried to condense information into smaller and smaller spaces. Various cultures have created their own techniques and adopted techniques from other cultures. In this post, I'll go through the history of information density with numbers, and compare this to computers and quantum computers.

Ancient Beginnings

Prehistory

Long long ago, before writing was developed, people must have counted on their fingers. But they also sometimes made markings to count. Imagine you are in prehistory counting sheep with a stone, making one mark per sheep. If there are $x$ sheep, it will take $n$ markings on the stone to count them all. $n = x$ Unfortunately, the stone only has space for a few markings. But since we are talking about concepts here, if the stone was big enough, you could keep counting more sheep forever. So your system is capable of counting to infinity in theory. $L_{x} = \infty$ There is only one type of mark you make, a simple line in the stone. So the number of types of symbols is $S = 1$

Ancient Rome

Much later, the Romans started using Roman numerals. They started making marks just like in prehistory: I is 1, II is 2, and III is 3. But then they thought, why make five marks like IIIII for five when you can just write V?

And here we are in year MMXXI. Unfortunately, the highest symbol is M, for 1000. Additionally, only up to three of the same letter are allowed in a row. This makes the largest number MMMCMXCIX, or 3999. In the middle ages, a line on top called a vinculum was used, multiplying a number by 1000 [1]. This means MMMCMXCIX is 3999000, and then we can add CMXCIX (999) to get MMMCMXCIXCMXCIX (3999999). Then we're really stuck.

So we have $L_{x} = 3999999$ . Looking at MMMCMXCIXCMXCIX, we count 16 symbols needed to write this number (including the vinculum). Now how can we determine $n$ , the number of symbols needed to write a number, as a function of $x$ , the number we are writing?

For simplicity, I'll only consider numbers up to 3999, so let's bring $L_{x}$ down to 3999. Let's map each place value to a Roman numeral. The ones place can have I, II, III, IV, V, VI, VII, VIII, IX. This is an average of 20/9 symbols, or about 2.2 symbols. The tens place can have X, XX, XXX, XL, L, LX, LXX, LXXX, XC. This is very similar, and has the same average symbols. The hundreds place is the same. So we can say each new place value takes another 20/9 symbols to represent. This gives something like $n = \frac{20}{9} \log_{10} x$ Does this work? We can make a Matplotlib graph comparing this model to the actual length of the Roman numerals for a number $x$ . Here is what we get:

Our model (in orange) seems to approximate the data (blue) reasonably well over the long term. There is also a moving average of the data over 100 numbers (green) which stays close to the orange line. You can see the low spikes due to numbers like M (1000) and MMD (2500). In case you're curious, the highest point on the graph is $x = 3888$ , written with 15 symbols as MMMDCCCLXXXVIII.

Ignoring the vinculum, we see that the symbols increase in a 1/5 pattern for each place value. So if we wanted to expand the Roman numeral system, we would need 2 symbols for each place value: continuing on the pattern of I/V for ones, X/L for tens, and C/D for hundreds. In the long term, this can be modeled as $S = 2 \log_{10} x$ So while the number of symbols $n$ is less than with prehistoric writing, we now face an ever-increasing number of types of symbols $S$ .

Ancient India

Ancient India used a system quite different from that of the Romans. (For unicode compatibility reasons, I'll use modern numerals here. The symbols at the time were different, although still vaguely recognizable. They were 𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯 if you can see that.)

The Indians also used a simple line for one, like the Romans (and the Chinese and others.) But for a two, instead of two lines, there was a new symbol: 2. This continued all the way to 9. For ten, the symbol for one was used again, and then there was a new symbol for 0. This concept of place value could be used to write any number ( $1729 = 1000 + 700 + 20 + 9$ ). Each additional decimal place value required one new symbol rather than an average of 2.2, giving $n = \log_{10} x$ This is a lower rate of increase than roman numerals. Even more importantly, $L_{x} = \infty$ : arbitrarily large numbers could now be written with enough symbols. We can make a graph for this system, now known as Arabic numerals (since Arabs brought the system from India to Europe), similar to that for Roman numerals.

We have confirmed that our model $n = \log_{10} x$ is roughly accurate, but unlike with Roman numerals, this model can now be extended as $x \to \infty$ even with just 10 symbols: $S = 10$ . This combined the best of the prehistoric and Roman systems: $n$ increases only as the log of $x$ rather than linearly, but the number of symbols $S$ is kept fixed no matter how large $x$ gets.

Other Ancient Civilizations

The Roman system and the modern descendant of the Indian system are probably the most familiar numeral systems to most readers. However, there were many other ancient numeral systems, at least two of which are still in common use today. I have analyzed these if you're interested.

Ancient Chinese

A descendant of the ancient Chinese system is used in China and Japan today when writing with characters rather than Arabic numerals. For unicode compatibility reasons, I will be using modern Chinese characters here. This is the system used in Chinese characters and Japanese Kanji; there is also an East Asian numeral system using counting rods, which works similarly to the modern system and even includes negative numbers.

Chinese has unique symbols for numbers 1–10. 20 is represented not by 2 and then 0, but by 2 and then 10: 二十. This is multiplication by placing a number 1–10 in front of a larger place. Addition occurs by placing it after: for example, while 二十 is $2 \times 10 = 20$ , 十二 is $10 + 2 = 12$ . Both of these can be combined: 32 is 三十二, since $3 \times 10 + 2 = 32$ .

This works up to 99. At 100, a new symbol 百 is introduced for 100. Then 二百二十二 is $2 \times 100 + 2 \times 10 + 2 = 222$ . There is a similar symbol 千 for 1000.

After 1000, a new character is introduced for 10,000. Then a new character must be introduced for every multiple of $10^{4}$ . So the limit $L_{x}$ is high, but it is not infinite. I am not sure exactly how many characters are commonly used these days.

Now let's analyze the number of characters $n$ it takes to write a number $x$ . For numbers up to 10, only one character is needed. For numbers above 10, we write the tens place and then one more character for the ones place, if needed. If there is a 10 in the tens place, we simply write one character, 十. Otherwise, two characters are needed. So the tens place adds about $\frac{1}{9} (1) + \frac{8}{9} (2) = \frac{17}{9}$ 17/9 characters, about 1.9.

Now let's look at the hundreds place. From here on, we must include multiples of one. For example, 115 is not 百十五, but 一百一十五. This means that from 100 on, each place value (other than one) always takes two characters. Since we are interested in long term trends, we can write $n = 2 \log_{10} x$ This grows quicker than the modern decimal system, but more slowly than the Roman system: $1 < 2 < \frac{20}{9}$ Past the ten thousands place, there is no character for hundred thousands. How will we handle the hundred thousands place?

We will have to multiply the ten thousands place by a new tens place. This will mean adding a number 10–99 before the character for 10,000. This will take another 2 characters. So our rule of $n = 2 \log_{10} x$ still applies. As for the number of symbols, in the long term, we need a new symbol for every 10,000. So we can write $S = \log_{10000} x$

Ancient Greek, Hebrew, Arabic

The Greek, Hebrew, and Arabic numeral systems are all functionally identical, and differ mainly in the symbols. Greek uses Αʹ Βʹ Γʹ, Hebrew uses א ב ג‎, and Arabic uses ا ب ج for 1 2 3. These systems are all used today for at least some purposes. Arabs used these before they adopted what we know as Arabic numerals from India.

These systems have unique symbols for 1–9, 10–90, and 100–900. They vary in their treatments of numbers past this, but are different from what is shown here, so let's say $L_{x} = 999$ for this system. It's notable that each new place value requires only one more letter. For example, 4 is Δʹ/ד/د, 44 is ΜΔʹ/מד/مد, 444 is ΥΜΔʹ/תמד/تمد. Therefore we can write the same rule $n = \log_{10} x$ Interestingly, this is the same as modern Arabic numerals. The problem is that this requires a large number of new unique symbols: 9 for each place value. So we have $S = 9 \log_{10} S$

Ancient Egyptian

Unfortunately, Egyptian hieroglyphics are not as widely supported in unicode as the other scripts on here, so you may have problems with rendering.

The Ancient Egyptian numeral system starts like the prehistoric one: each line represents a one. So 𓏺 is one, 𓏻 is two, 𓐂 is nine. After that, there are new symbols for each place value: 𓎆 is ten. These can be combined: 𓎇𓏾 is 25, and 𓍣𓎊𓏿 is 256.

The number of symbols needed for each place value obviously depends on the number, but it ranges from 1–9; this averages to 5. So we can say $n = 5 \log_{10} x$ Each new place value needs a new symbol. So $S = \log_{10} x$ The highest symbol was 𓁨 for $10^{6}$ . Let's set this as $L_{x}$ .

Ancient Mayan

Unfortunately, Mayan numerals are not supported on my computer in unicode, and I am not good at writing Ancient Mayan symbols, so I will not be able to show the script here. However, it is conceptually identical to the modern Arabic numerals, even including a symbol for zero. The big difference is that it is base-20, giving

n = \log_{20} x

S = 20

Meaning that the Mayan system is even more efficient that the modern one in terms of

n

. (This is considering number of places alone. The symbols in individual places often have multiple components: for example, three is three dots; fourteen is four dots and two lines. This is almost like a numeral system within a larger place value system, rather than arbitrary symbols.)

Comparison

Now, let's compare all the systems we've seen so far. Then we'll see where modern systems come in.

As a refresher, $x$ is the number being written. $n$ is the number of symbols required to write $x$ in this system. $L_{x}$ is the maximum number $x$ that can be written with the system as is. $S$ is the number of unique symbols that would be needed to write an arbitrarily large $x$ with a version of the system, perhaps adding more symbols as needed.

System	$n$	$L_{x}$	$S$
Prehistoric (tallies)	$x$	$\infty$	1
Modern (decimal)	$\log_{10} x$	$\infty$	10
Roman	$\frac{20}{9} \log_{10} x$	3999	$2 \log_{10} x$
Chinese	$2 \log_{10} x$	$> 10^{8}$	$\log_{10000} x$
Greek/Hebrew/Arabic	$\log_{10} x$	999	$9 \log_{10} x$
Egyptian	$5 \log_{10} x$	$10^{6}$	$\log_{10} x$
Mayan	$\log_{20} x$	$\infty$	20

Other than the prehistoric system, it seems all the $n$ values follow logarithmic growth. Is this the most efficient system possible? Well, exponential growth is one of the fastest types of growth out there, so its inverse logarithmic growth should be one of the slowest. I can think of one system that grows faster than exponential: factorial. What would an inverse factorial system look like?

24 might take 4 digits to represent rather than 2. 120 would take 5. 720 would take 6. So far, this system is not doing well. But there is a point, somewhere way out there, beyond which the inverse factorial system will be more efficient than our system. Now to actually invent an inverse factorial method for representing numerals. I haven't seen anything like it. (Edit: when I went to sleep after writing this I couldn't stop thinking about an inverse factorial system, and then I thought I came up with one, but it turns out I didn't. You can look if you're interested, or I can leave it as an “exercise to the reader”.)

My attempted inverse factorial system

Instead of a 1s place, 10s place, 100s place... $10^{n}$ place... I had a 2s place, 3s place, 4s place... $n + 2$ place. Each place could have either a 1 or 0. Then you multiply together all the places that have a 1. So $1_{!}$ is 2, $01_{!}$ is 3, $11_{!} = 2 \times 3 = 6$ , $101_{!} = 2 \times 4 = 8$ . I then made some minor restrictions and a computer program so every number had a unique representation.

The reason why I thought this system would follow an inverse factorial law is that in the system, $(n + 1)!$ is written as a series of $n$ ones: 6 is $11_{!}$ , 24 is $111_{!}$ , 120 is $1111_{!}$ .

But this system has some drawbacks. Notably, a prime number $p$ requires $p - 1$ digits to write, following a rule similar not to inverse factorial, but to $n = x$ . I wrote a program to figure out the number of digits $n$ needed to write a number $x$ in this system. Here are some graphs:

On the high side, you can see something approximately $n = x$ due to prime numbers. My orange model is $n = 2 x / \ln x$ . This is very interesting, because a surprisingly good approximation for the number of primes less than $x$ is $\frac{x}{\ln (x)}$ I have no idea why it seems to follow this model so closely.

You could also have something like $n = \log (\log (x))$ . Again, I don't know how you would invent a numeral system that follows this rule, but it might be theoretically possible, and if so it would be quite efficient.

Modern Systems

Scientific notation

Scientific notation is based on the regular decimal system, so it also has $L_{x} = \infty$ and $S = 10$ . However, $n$ can be anything you want, no matter how large or small $x$ is. The problem is that this sacrifices accuracy (which can actually be a good thing when experimental uncertainty is involved). I found it interesting how scientific notation fits into the variables we've been looking at for ancient numeral systems.

Binary and Hexadecimal

Binary and Hexadecimal are the same as the decimal system, but with a different base. (This is similar to the Mayan system.) But since $n = \log_{2} x$ for a binary computer, the number of bits $n$ required actually increases faster than for writing a number on paper. By change of base, the ratio is $\frac{\ln 10}{\ln 2} \approx 3.3$ So if a number $x$ takes $n_{10}$ digits to write on paper, we can predict it will take $n_{2} \approx 3.3 n_{10}$ bits in a computer as $x$ grows large. For example, 2021 takes four digits on paper, and the binary representation 11111100101 takes 11 bits. 11/4 = 2.75. If we take a really big number, like 1234567890, this takes ten digits to write on paper, and 31 bits to represent as 1001001100101100000001011010010. 31/10 approaches 3.3 better.

In reality, numbers may take up more or less space in computer memory based on different systems to store them, such as two's complement and float64. However, we are now using numbers as a template to talk more generally about information density.

I can talk about an abstract information density, which is the amount of information $x$ divided by the space to store it $n$ . With the prehistoric system, this is $x / x = 1$ . With the decimal system, this is $x / \log_{10} x$ , which tends to be higher than 1: each digit in a decimal number, like the 2 in 123, has much more meaning than a simple tally mark.

But with computers, this is $x / \log_{2} x$ , which is apparently lower than for the decimal system. This can be explained by the fact that even though computers need a high number of bits, the bits are extremely small physically. They can fit on a microchip.

DNA

DNA has four types of nitrogenous bases, so $S = 4$ . Every 3 base pairs codes for one amino acid (or a stop codon). The number of base pairs needed to code for a protein can be considered $n$ . Such a protein will have $n / 3$ amino acids. The question is, what number $x$ corresponds to the amount of information stored in a protein with these amino acids?

I feel like the amount of information in a protein with 100 amino acids is a lot more than the amount of information conveyed by writing 100 on a piece of paper. Since there are many complicated ways for amino acids to interact as their number increases, moving into tertiary and even quaternary structure, I feel like amino acids have even more ways to add meaning $x$ as their number $n / 3$ increases. This means $x$ might be exponentially related to $n$ . So we can invert this and say $\frac{n}{3} \propto \log_{4} x$ $n \propto \log_{4} x$ As for the limit of information, DNA can store an immense ammount of information. I'll go with $L_{x} = \infty$ .

Quantum computing

In quantum computing, quantum particles are in a superposition of states. When measured, they settle on a state that corresponds to either 0 or 1, both with equal probability. So if we have $n$ qubits (quantum bits), each corresponding to 0 or 1, that's like a binary number with $n$ bits. So then what's the advantage of quantum computing?

Qubits settle on 0 or 1, when measured. If we don't measure them, they remain in a superposition of all possible states at once. This means that with a properly designed algorithm, they can conduct multiple calculations at once.

Then the question is, how do we access this information if every measurement settles on 0 or 1? This is part of the reason why quantum computing is so difficult, but it is possible through quantum interference. This is the strange phenomenon where since a quantum particle is also a wave (wave-particle duality), this wave can interfere with itself. (This can be shown with the double slit experiment.) The various calculations can be combined and measured to get a useful result [2].

So in theory, a quantum algorithm can perform infinite calculations at once. This means $n = C$ : as long as you have the constant $C$ qubits for the calculation you're interested in, that same number of qubits can store an infinite amount of information $x$ .

In practice, it is difficult to combine the results in order to extract information, because each qubit's measured state is either 0 or 1. This means that for $n$ qubits, the information $x$ is not infinite, at least not based on any measurements we can take. As for the exact relationship between $n$ and $x$ , I'm not sure. I think it might depend on the algorithm in question, which makes sense since each algorithm has its own $n$ . This reminds me of hard computer science problems that can't be solved generally, like the halting problem, shown undecidable by Alan Turing.

There are a few options for a “type of symbol” in quantum computing. One example is a quantum computer based on ions. In this case, we could say $S = 1$ . However, we also need to consider the various connections between qubits. Then we could maybe say $S = 2$ , or maybe we could consider the various quantum logic gates [3], composed of qubits, as a new type of symbol in themselves.

What is the limit of information $L_{x}$ that can be stored? In theory, more qubits always means more information. In practice, qubits interacting can lead to quantum decoherence, losing the unique quantum properties needed for quantum computers to work. As of September 2020, IBM's largest quantum computer had 65 qubits [4].

So even though superposition means an infinite possible number of states, this doesn't translate into an infinitely larger amount of information compared to a classical computer. However, if quantum decoherence can be kept under control, the idea is promising.

But as with new technologies in general, I worry about how it might be applied negatively. Quantum computing has huge potential for positive advancements in cryptography. On the other hand, this means it has potential for a lot of hacking. If quantum computing is implemented correctly by a hacker, quantum superposition would make our current classical systems completely vulnerable. Hopefully (?) quantum decoherence remains a large issue, so hackers won't be able to solve it reliably, but organized teams will be able to use it for good in certain applications.

Conclusion

Starting with prehistoric counting, I introduced a set of variables with which we can view systems of information. All the numeral systems devised in ancient times fit neatly into this set. Binary and hexadecimal as used in classical computing worked as well.

DNA and scientific notation were difficult but interesting to analyze in the way I did for numeral systems. Quantum computing, on the other hand, became almost impossible. The theoretical aspects of quantum theory and superposition repeatedly imply “infinite power”, but the practical barriers are hard to quantify, especially because they're always being pushed. It's clear that developing quantum computing will be extremely difficult. It's also clear that at least for small numbers of qubits without too much decoherence, quantum computing is promising to handle huge amounts of data at high speeds.

Roman numerals, with their low limit on data $L_{x}$ and quickly increasing space needed $n$ , bear little resemblance to the ideals of what quantum computing could be.

Romans needed quantum computers.

Mathematical History — Why we need more historians doing math
2+3=4 — Why math seems to describe the physical world so well

References

Which is the biggest number in Roman numerals? (Roman Numerals) ^
What Can We Do with a Quantum Computer? (Andris Ambainis) (University of Latvia) ^
Quantum Logic Gates (NIST) ^
IBM promises 1000-qubit quantum computer—a milestone—by 2023 (Adrian Cho) ^

Harys Dalvi

Someone using AI won't take your job. AI will

AI Agents Are Coming For Your Job

Staying Relevant in the Face of AI

Related Posts

LLMs Do Not Predict the Next Word

Why LLMs Predict the Next Word

Instruction Finetuning

Reinforcement Learning

Reward Modeling

Proximal Policy Optimization

LLMs as Chess Players

Why AI Agents?

Related Posts

References

Why AI Can't Do Your Laundry

Can AI Do Your Laundry?

Should AI Do Your Laundry?

Besides Laundry, What Should AI Do?

AI for Human Potential

Related Posts

References

Predicting Every Election Since 1916

How is This Possible?

How Likely is This?

Predicting Every Election Since 1916

Takeaways for a Scientist

Related Posts

References

Can Transformers Solve Everything?

This Isn't the First Time

Scaling

Large Companies

Smaller Companies

Domain-Specific Applications

Algorithmic Constraints

Demonstration: Bad Algorithm means More Compute

Bad Data means Bad Algorithm

Conclusion

Related Posts

References

Connecting Covariance and Rotational Inertia

Definitions

Random Sampling by Mass

The Off-Diagonal Elements

Asymmetric Objects

Using Eigenvectors

Decorrelation Transform

Conclusion

Related Posts

References

Walking Across Rhode Island

BardGPT

How it Works

Data

Model Architecture

Results

Beyond Rhyming

Related Posts

References

On the Shoulders of Giants

Thinking Big

Shoulders of Giants

Innovative Ideas

Takeaways

Related Posts

References

DiscordGPT

Obtaining Data

Preprocessing

Training

Testing

Results

Observations

Related Posts

References

The Pipe Room

Semantics

Physical Mediums for Consciousness

Related Posts