Is language visual? Kanji experiments

Machine Learning


Broken printers are widely discussed on Douban, a Chinese social platform. The owner stated that when the printer runs low on ink, only the top half of any letter will print. However, the text was completely easy to read.

Look at the three versions of artificial intelligence (“artificial intelligence”).

Image by author: 4 characters with different crops

You can instantly read all three: whole letters, 80% retention, and 50% retention. It’s not a trick, it’s probably something fundamentally rooted in the Chinese system.

Let me explain one thing. 80% and 50% are image It retains itself, not the individual characters. Just crop the image horizontally at a fixed height, keeping in mind that each character occupies a different number of pixels in the image.

This is what I thought from this. Is language, at least Chinese, fundamentally visual? I went over this in my head for a few days and finally decided to look into it the way I know how. It’s about training some language models and seeing what actually happens.

Experiment: Pixel input, token output

Every language model has to deal with tokenization first. The basic idea is that computers cannot understand text, so they assign an ID, or number, to each word or character. For example, the character “你” becomes 100, and the character “你” becomes 3. From there, the LLM teaches you everything from scratch.

In this sense, reducing characters such as mountain and water to simple integers amounts to abandoning their form. And in kanji, beautiful Shape — Stroke composition, radical components, and spatial layout that convey real information. Another example: uda (beat), beat (putt), and 拉 (pull) all share the radical 扌 (hand). If you reduce them to IDs 423, 1089, and 2341, that relationship disappears.

So instead of a token ID, I rendered each character as a grayscale image and fed it to the language model. The model’s job was to predict the next character.

good eyesight not required

If you’ve ever taken your glasses off to read a book, you know that you can read blurry text. The same principle occurs here.

Check out the 8×8 pixel version of Artificial Intelligence (hold the screen at arm’s length).

Image by author: 8×8 pixel resolution with different cropping

Each character is 64 pixels. And a model trained with input at this resolution performs as well as a model trained with 80×80 images.

In fact, we tested image resolutions from 4×4 to 80×80 and found the following: Going from 8×8 to 80×80, or a 100x increase in pixels, essentially means nothing.

The cropped results are even more impressive and exciting. If 50% of each character is removed, the accuracy decreases by less than 2%. The model does not require a clear image of the whole thing. I found that I needed enough structure to know which extremist family a character belonged to.

(Methodology note: In the example above, I placed the full and cropped versions side by side for comparison. In the actual experiment, each training condition is completely independent. I have never seen a full model trained on cropped characters.)

hot start effect

In other words, the visual model is better than something text-based?

As it turns out, that’s not the case. Both converge to essentially the same final accuracy. But the journey, especially the beginning, looks very different.

Just by looking at it 0.4% of training stepsthe visual model is already twice as accurate as the text-based baseline.

Image by author: Early dynamics

This is what we call hot start effect. The visual model arrives at training already knowing the useful information that “bat”, “beat”, and “拉” are similar and probably behave similarly. Text-based models start with random embeddings, and this must be understood from the beginning.

You can see this directly by looking at the embedding space during initialization (before training).

At a very early training stage, you can see that the characters share the same radical cluster. Cosine similarity for radical sharing pairs: ~0.27 for visual embeddings and ~0.002 for random token embeddings.

Why races end in draws

The key is visual pre-encoding. visual Similar but not similar linguistic Co-occurrence. However, the prediction of the next character ultimately depends on the latter.

Yes, hit, beat, and 拉 all share “扌” and are similar. However, in actual texts, these can appear in very different contexts, such as bat crime (fighting crime), video photography photography (taking pictures), 拉道经济 (stimulating the economy), and so on. Once a text-based model sees enough data to learn these patterns, the onset of visual preconditions is no longer important.

In other words, visual input warm-starts the optimization. But, well, the upper limit of information remains the same.

Whenever I hear this, I think of Ted Chiang’s story. your life story (The movie was based on arrival). In a story, written and spoken language are two separate systems. But they ultimately serve the same purpose: communication. Two roads, same destination.

This is what actually matters

There are real-world situations where it matters, even if it’s the same destination.

Low resource settings. When you don’t have much training data, a visual head start turns into a practical advantage. In our experiments, with only 10,000 samples, the visual model is already fully trained China Downstream Benchmark Text Baseline (C-eval).

Damaged historical documents. This is also exciting. Visuals can help you see classic Chinese manuscripts, damaged books, and handwritten documents with missing or faded strokes.

What about computing?

Good news: There’s very little overhead. The simple visual encoder I actually used was: few The parameter is larger than the text baseline (12.6 million vs. 19 million). Memory overhead: +1.3%. Therefore, we argue that visual priors are nearly free.

short answer

Is Chinese a visual language? The answer seems to be: At first, yes. In the end, it doesn’t matter.

The visual structure gives the model a hot start. This is similar to how a human reader sees a “扌” and immediately knows that it is in the realm of hand-related movements. But deeper patterns in language need to be learned from data. Both expressions are learned in the same way.

This paper can be found on arxiv. https://arxiv.org/abs/2601.09566



Source link