I recently spent some time seriously considering a local AI setup. The main idea was to build a private knowledge base and have an AI answer questions or draft articles from it. For example, I wanted a collection built from around 500 books, averaging roughly 400 pages each, focused on economics and sociology. That kind of use case still isn’t handled especially well by most public AI platforms.
At first I assumed local deployment would be straightforward. After a few days of digging into how it actually works, I ended up abandoning the idea and going back to online tools instead.

The real bottleneck is embedding, not the chat model
Like many people, I initially thought the most important part of a local AI stack was the inference model—the conversational model, the reasoning model, the model that actually talks back. Something like DeepSeek R1, for instance.
But for a knowledge-base workflow, the more important piece is usually the embedding side.
When an AI answers a question against a knowledge base, the chat model is not simply "remembering" the documents. It needs to search the knowledge base, rank relevant passages, and then use the most relevant material to generate an answer. If the documents were embedded poorly, retrieval quality falls apart. And once retrieval is off, the final answer can become irrelevant, fragmented, or simply wrong.
In practice, knowledge-base vectorization is not just one model. It should really be thought of as three parts working together:
- an embedding model
- a reranking model
- database management
Most people focus only on the embedding model itself. But poor vectorization quality is one of the biggest reasons self-built AI applications feel disappointing in actual use.
Common mistakes when choosing embedding and reranking models
For local embedding, many people seem to default to the BGE 4B model because it gets recommended so often. After looking into it more carefully, I came to think Alibaba’s Qwen 3-Embedding-8B is the better choice, and that it should ideally be paired with Qwen 3-Reranker-4B.
Moving from a 4B embedding model to an 8B one can noticeably improve knowledge-base retrieval. By comparison, upgrading the reranker tends to have a smaller effect. So if resources are limited, prioritizing a stronger embedding model makes more sense. That said, the reranking model is still necessary.
Even so, a self-hosted setup built around an 8B embedding model plus a 4B reranker is, at best, usable—not truly good.
Online services still have obvious advantages. They are usually better at compressing the semantics of long text during embedding, better at cross-modal semantic alignment, and generally much more mature in how the database layer is handled. A simple example is document ingestion: a self-hosted setup may fail to correctly parse image-based tables in uploaded materials, or it may flatten them into plain text after recognition. Online systems, by contrast, often recognize the tables correctly and preserve them as tables, which can make retrieval far more accurate.
And this is really the core issue: because of personal hardware limits, it is basically impossible to reproduce locally the full strength of what online platforms provide. What those platforms offer is not just a model, but an entire service stack.
Local inference models run into hard limits quickly
Hardware constraints also make local inference far less capable than people often expect.
For an individual user, a 14B local model is already on the upper end of what is realistic. But a 14B model can only handle around 8,000 Chinese characters at a time, counting both input and output. That is nowhere near enough for serious long-form generation.
If your goal is to generate texts longer than 20,000 characters, you would realistically need something closer to a 70B model running locally. For most individuals, that is basically out of reach.
Online models are in a completely different league here. Producing 20,000 characters in one go is relatively easy for them. In my own testing, Doubao was able to generate an article of more than 40,000 characters, and the quality was actually quite solid. The test topic was On How Stablecoins May Reshape the Global Monetary Order and Affect the Internationalization of the Renminbi. I only provided the prompt and organized the outline; Doubao generated the full body of the article. Only the abstract at the beginning was written manually.
In that test, Doubao performed well in three areas at once:
- output length
- factual accuracy within the article
- internal logic and coherence across sections
I also tried a few other models:
Tencent Hunyuan only produced what was essentially an outline, visually around 1,000 words, yet claimed it was 12,000.
DeepSeek R1 generated roughly 10,000 words, but the result felt like a pile of separate blocks rather than a unified article. Transitions between sections were unnatural.
Tongyi Qwen Max latest from Alibaba—often described as especially strong for Chinese reasoning and professional writing—performed about on par with DeepSeek, better than Hunyuan, but still not as good as Doubao in this test.
Deployment is only the beginning: tuning and model unloading matter
Even after the models are running, the work is not finished.
Embedding models need tuning: chunk size, overlap size, similarity thresholds, and so on. Inference models also need tuning: context length, temperature, and other generation settings. These do affect answer quality.
But the most important tuning issue is often not answer quality at all—it is memory management.
In real local deployment, you need to think about things like:
- slicing and segmentation
- KV cache handling
- model unloading
The main goal is to avoid running out of VRAM.
If you deploy an inference model, an embedding model, and a reranking model together, even the inference model alone may already risk exhausting GPU memory. If two or three models are active at the same time, VRAM overflow becomes very likely.
That is where model unloading comes in.
Here, "unloading" does not mean uninstalling the model. It means shutting it down temporarily to release VRAM. In practice, both unloading and the broader tuning workflow are best handled through Python-based automation. But configuring that environment is not trivial, and for many people it becomes a serious barrier to entry—or a reason to give up entirely.
Model upgrades create an ongoing maintenance burden
Suppose the environment is finally configured and everything works. That still is not the end of the story.
Models are updating fast. Roughly every six months, there tends to be a major new generation. If the main point of building a local AI system is to maintain a private knowledge base, then a major upgrade to the embedding model will probably force you to reconfigure the setup and regenerate the knowledge base.
That is a lot of work.
And it creates an awkward tradeoff: you built the local knowledge base for convenience, but now you may have to rebuild it every half year. After regeneration, the accumulated interaction habits and practical prompting patterns from previous use may also no longer carry over cleanly. Under those conditions, the value of self-hosting starts to look much weaker.
If local deployment is a must, here is the practical advice
Despite all of these drawbacks, local deployment will still be a hard requirement for some people for various reasons. If the goal is a knowledge base plus long-form writing, this is the setup I would lean toward.
If you have more than 16 GB of local memory
Use Cherry Studio as the local interface.
For the inference model, call Qwen Max latest through Alibaba Cloud’s API.
For retrieval:
- use Qwen 3-Embedding-8B for embeddings
- use Qwen 3-Reranker-4B for reranking
Deploy local models through Ollama.
If the inference model must also be local, then Qwen3-14B or DeepSeek R1 14B would be the recommended options.
If you have less than 12 GB of local memory
Still use Cherry Studio.
For models:
- inference: Qwen3-8B
- embedding: Qwen 3-Embedding-8B or 4B
- reranking: Qwen 3-Reranker-4B
My own intuitive view is that if the local inference model is smaller than 32B, local deployment starts to lose much of its meaning—especially now that API pricing keeps getting cheaper.
Whatever the setup, memory management has to be strict. The inference, embedding, and reranking models should not run simultaneously. One should finish, release memory, and only then should the next one start.
A suggestion I received from DeepSeek was to set the LLM context window parameter like this: ctx-size = 4096. The default 8192 may trigger OOM.
DeepSeek also suggested a hybrid architecture: keep both online and local inference available, and route requests automatically based on output size. If the required output exceeds the local model’s threshold—for example, more than 4K—then hand the task off to the online inference model.
Why online tools make more sense for most people
A lot of online AI products now advertise 128K output, which is already enough for generating articles longer than 20,000 characters. That is far beyond what a local 14B model can realistically do. Doubao reportedly supports 256K output, which translates to more than 40,000 Chinese characters.
As models continue to improve, long-form generation will only become easier, output quality will keep rising, and API costs will likely continue falling. If that trend holds, the case for local deployment becomes weaker and weaker for ordinary users.