On April Fools' Day this year, I made a fake YuGPT and claimed it was a language model trained on my own writing. In reality it was just a JavaScript script: no matter what you asked, it returned a fixed answer. The joke was fake, but the idea behind it wasn’t. I really had planned to build a small YuGPT as an Easter egg. I was just too busy at the time—mostly being busy fishing everywhere—and now that everything outside is frozen over, the winter solstice felt like a good moment to finally finish it.
What interested me wasn’t whether frontier models could keep getting stronger. By the end of 2024, just about all digitizable human knowledge has already been fed into large language models. Most of what humanity has accumulated through writing and oral transmission has already been absorbed. The most advanced models now are reasoning models. A crude way to think about them is that they internally “play out” several rounds of thought before answering. If we want to keep extracting new knowledge beyond that, it may require reasoning systems that can interact with the external world directly—systems with eyes and hands, free to explore whether on Earth, the Moon, or Mars.
That upper bound isn’t what fascinates me most. I don’t even understand the mathematical proofs these systems spit out anymore. What I care about more is the lower bound: how much compute and how many parameters do you actually need to build something that can pass a Turing test about half the time?
Apple’s on-device models this year already give a rough answer. If what you want is an assistant with the equivalent of a basic modern education—something that can respond to schedule requests, summarize articles, or draft polite email replies—then the task appears to be manageable with roughly a 1B-parameter language model. That is basically the class of model Apple has embedded into newer phones. When a request gets more complicated, the system hands things off to a cloud model.
That also gives us a useful reference point for the rest of the field. Among open models, around 70B parameters is roughly the top end. In the cloud, systems with a few hundred billion parameters are also near the ceiling. For a long time, many people assumed that simply increasing parameter counts and building deeper networks would continue to unlock more capability. I still think that basic line of thinking is fine. The current problem is that we’re running out of high-quality training data. A lot of training data for today’s large models is already generated by other large models.
So if you look at it from the standpoint of human cognition, 70B parameters is probably already enough, and enough to reach something like university-student-level competence.
There’s another imperfect analogy here. Suppose I want to describe how a car accelerates. I only need two quantities: acceleration and mass. If I treat mass as constant, then in practice a single parameter—acceleration—is enough to describe the process. Once human knowledge gets abstracted and formalized mathematically, many concrete phenomena can be compressed into surprisingly few parameters.
That makes it tempting to think about the human brain in similar terms. A human brain has roughly 86 billion neurons. If you described each neuron with one independent parameter, then you would be talking about a model with around 86B parameters. As for how those parameters influence one another, that’s a lot like what we see in deep neural networks: at best we can inspect architecture and activation regions, but we are nowhere close to grasping the full physical and chemical process.
When a person is doing a specific language and cognition task, only a fraction of the whole system is actually in use. So if your goal is to build a language model with human-like behavior, it’s hard to imagine the parameter count needing to exceed 86B. A monkey without language has perhaps around 20B neurons, so maybe an optimized language model shouldn’t need to exceed 50B. An ant, by contrast, has only about 500,000 neurons, but a colony of 1,000 ants already gives you around 5B interacting neurons, and a large colony can easily reach several thousand or even tens of thousands of billions of interacting neurons.
We can comfortably understand division of labor inside a small ant nest, but a truly large colony may already exceed what our own intelligence can intuitively grasp. Maybe future reasoning models will tell us what kind of intelligence is hidden there. Biological analogies are sloppy, of course, but they can still help us guess an order of magnitude from the standpoint of degrees of freedom or model complexity. In a very loose sense, the more parameters an optimal model requires, the more difficult and complex the task it is solving. And the hardest problems we can encounter in turn define the upper end of the complexity we need.
What’s interesting is that the language models we currently find genuinely usable happen to sit right around the billion-parameter scale. Apple’s on-device model suggests that 1B corresponds roughly to the level of a person who has completed compulsory education. Then maybe 3B to 5B looks more like a high school graduate, and 7B to 13B starts to show something closer to a person with higher education. That analogy has plenty of holes. I’m not discussing architectures like MoE, and weird sizes like 3B or 13B often exist partly because they fit GPU memory constraints during training. Still, the rough pattern feels suggestive.
I know many phones in China have already deployed local models at around the 7B scale. That’s only the lower bound. My guess is that within a few years, on-device models will pass 13B, and when they do, people may get the same kind of shock they felt when they first encountered ChatGPT at the end of 2022. By then, language models may show up more naturally as agents—systems that can carry out chains of complex actions rather than just answer prompts. It would be a bit like the old transition from the desktop internet to the mobile internet.
Phones also have something language models have barely begun to exploit: sensors. Once a model can access those sensors, it gains a path to collecting new high-quality data from the real world. That may lead to more interesting developments later.
As for computers, running large models locally is already mature enough. Ten years ago I felt PC performance was already excessive, though that was mostly because I didn’t really play games. These days I mostly use handheld devices anyway. But from the perspective of local LLMs, modern computers are in a very practical sweet spot: if you’re willing to upgrade your hardware, you can run larger models and get better output, so we’re still not at the stage of true performance surplus.
My own machine is a 2021 MacBook Pro with an M1 Pro. Even though it’s an M-series chip, it can still run local models through Ollama without much trouble. A 7B model is easy, and it can even handle Qwen2.5 14B. Apple’s latest base-model Mac mini with the M4 chip probably lands in a similar range. Of course, if I actually wanted to deploy a local LLM server as a household assistant, I’d still prefer NVIDIA GPUs or one of the newer development boards. But for now I’m just experimenting. Everything I run locally is under 14B, and for the next five years that is probably the scale of most on-device models ordinary people will encounter, so it seems worth understanding their strengths and weaknesses now.
Back to YuGPT. There are basically two ways to build something like this today.
The first is fine-tuning an existing model. My own machine could probably do it, barely, but there are ready-made Unsloth notebooks online, and free Google Colab resources are there to be used. In practice I trained on a paid A100, which was very fast.
The second is RAG—retrieval-augmented generation. In that setup, before the model answers your question, it first searches your own knowledge base, such as a pile of documents, finds relevant passages, and then synthesizes a response. A search engine that crawls webpages and then gives you a summary is basically one form of RAG.
As far as I know, one of the newer directions there is Microsoft’s GraphRAG, which retrieves via a knowledge graph instead of just a plain vector database. They also said a LazyGraphRAG version would be released for faster graph construction and retrieval. I tried setting up parts of this locally and ran into all sorts of issues, mainly because I didn’t want to rely on paid APIs and switched everything over to local models. Then a bunch of inexplicable problems appeared. So I decided to wait until LazyGraphRAG is out and try again.
Still, local RAG feels more like the practical application route for future on-device models. Automatic local fine-tuning feels like the sort of advanced move that probably won’t become common until model architectures change more substantially. These days, there isn’t much technical barrier to hosting a personal knowledge base on your own machine. For most people, the harder problem is not deployment—it’s having enough usable text to form the knowledge base in the first place.
This time I chose the fine-tuning route. I took 104 essays from my blog and uploaded them to Hugging Face. Then I hacked up an Unsloth fine-tuning notebook to feed in my text. The base model I chose was Qwen2.5 14B Instruct. Training took a little over two hours. On an A100, 100 epochs can be done in under an hour. Looking at the process, convergence was slow. Probably that’s because I don’t have a highly consistent writing style, or because the kinds of thoughts I write about have already been thoroughly represented in the broader training data of existing LLMs.
But after fine-tuning, if you ask something like “What does yufree think about this issue?” or “What does Yu Miao think about this issue?”, the model will actually answer. Before fine-tuning, it didn’t know who that was. As for output quality, I’d say about half of it is nonsense, and the other half vaguely resembles me. It’s still a toy.
There are plenty of ways to improve it. One approach would be to first use a local model to preprocess the original text—extract positions, arguments, and responses—so the training examples are better structured. I actually did try that. But the generated JSON files kept failing halfway through, usually because the language model ignored my prompt and output something other than JSON. So in the end I switched to using the articles directly as input.
Another issue is that my writing tends to jump between ideas. A single article may touch on several topics, so breaking the training set into smaller chunks might help. The training set is also tiny, so a 14B model may simply not be getting enough signal in some areas.
If you want to try it yourself, install Ollama first, then run:
ollama run hf.co/yufree/Qwen2.5-14B-Instruct-bnb-4bit-yufree
If you don’t like asking questions through the command line, you can install any front end you want—AnythingLLM, LM Studio, LoraChat, and so on—and get something much closer to a ChatGPT-style interface.
What this really points to is that in the next few years, building another version of yourself is going to become easy. In a world already dealing with aging populations, that may not be entirely bad. A lot of people need companionship, but don’t necessarily want companionship from a generic machine. That makes the idea of licensed virtual personalities feel like a potentially interesting social phenomenon.
If someone has recorded enough of their life, then a kind of digital rebirth will probably become technically possible. Right now, what gets called digital resurrection usually means little more than making a photo move or generating a voice that sounds vaguely similar. But a future licensed virtual personality might be able to imitate a person’s actual way of thinking, not just their appearance or speech.
For ordinary people, that is not obviously good news. The ethical risks are high. If a virtual personality talks someone into committing a crime, things get awkward very quickly for the original person it was modeled after.
Maybe starting with our generation, it will become genuinely difficult for an individual to die completely. The body still has biological limits, of course. But styles of thinking, opinions, habits of expression—those can be reconstructed from the voice and text records people leave behind, assuming enough of those records exist.
People who keep extensive diaries or publish memoirs would probably be among the easiest to recreate as virtual personalities. Video creators and writers would also be relatively easy cases. In the future, you could even imagine a designed questionnaire or repeated human-machine interviews that gradually reconstruct a digital personality for people who want one.
The real question is not whether it can be done. It’s how many people would want to come back this way, and how many others would want them to.
I don’t mind if another version of me exists in the world.
The world might mind a lot more than I do.