Hacking Chinese With High-Frequency Phrases

Over the last two weeks, I’ve received a number of enquiries from P1 mothers on how to help their children learn Chinese. Besides P1 being the first major milestone for children, the pandemic has also made it harder for children to use and learn Chinese organically.

My typical response is to focus on both oral/listening and vocabulary. If your child cannot understand what his or her primary school teacher is saying, Chinese lessons are going to be both useless and boring. Meanwhile, a strong vocabulary foundation makes understanding and using Chinese simple.

We’ve been on the “Vocabulary is the key” boat for a long time, which was what inspired us to create a Chinese clone of Wordle to help primary school students revise vocabulary daily. But what happens if a child hasn’t had that much preparation prior to P1?

Is there a way to smartly accelerate the process of vocabulary learning?

TLDR: High-frequency lists are a common way to learn a language, but most lists focus on characters (单词) rather than phrases (词语), which can be challenging for young children. We used Data Science to determine the 50 most common phrases (词语) that appear in the P1A textbook, spelling lists and assessment books.

Read on if you’re interested in our methodology, and here’s the top ten most common phrases. By definition, high-frequency phrases will seem extremely simple to an adult since they are so commonplace. Yet this also makes revising them extremely high-value if your child has limited time.

1. 喜欢 (xǐ huɑn) - to like
2. 什么 (shén me) - what
3. 妈妈 (mā ma) - mother
4. 今天 (jīn tiān) - today
5. 爸爸 (bà ba) - father
6. 老师 (lǎo shī) - teacher
7. 星期 (xīng qī) - week
8. 哥哥 (xǐ huɑn) - older brother
9. 生日 (shēng rì) - birthday
10. 我们 (wǒ men) - we

80-20 Rule

One of my favourite principles is the 80-20 rule, which states that 80% of outcomes come from 20% of actions. 80% of exam questions come from 20% of the textbook material (hence the concept of “spotting questions”), 80% of a company’s revenue comes from 20% of its customers, etc.

So how can we identify the 20% of vocabulary that appear 80% of the time to “hack Chinese”?

High-Frequency Characters

My first thought was to look for high-frequency characters. This is a common strategy for improving English vocabulary – by focusing on the most commonly words, kids are able to improve reading dramatically.

High frequency words are a common strategy for learning English

So surely this should work for Chinese?

I did some research on Chinese high-frequency lists, and a disclaimer I found is that high-frequency lists are extremely dependent on the material that the lists were derived from, as a list that is built from formal learning texts would differ from a list build from storybooks, etc. That introduces the problem below.

The unique difficulties facing Singaporean kids

If you are were born and raised in Singapore, you might not realise the peculiar challenges facing our children when they learn Chinese.

Singapore has one of the largest Overseas Chinese populations in the world and 70+% of our population is ethnically Chinese, yet the primary language used is English. Many of Singapore’s official Chinese textbooks, learning materials and standards are heavily influenced by China, yet the majority of our day-to-day reading is English.

Hence, materials created for native speakers in China and Taiwan are too difficult, while content meant for foreign learners are too simple. So how do we find a high-frequency list that is suitable for Singaporean kids?

Since we can code, the obvious first approach was to create our own high-frequency list based on the most common words that a P1 child would encounter.

Based on our experience, the typical P1 child mainly reads two things in Chinese: 1. the P1 textbook and 2. his or her Chinese homework (and yes, we agree this is a sad state of affairs). Thus, we created a text corpus by combining the P1A textbook together with the thousands of P1 questions in our digital database with appropriate weights. And lastly, we ran the text corpus into a frequency counter to derive the most common characters.

It’s extremely easy to create a frequency counter (this is for the textbook only)

Houston, we have a problem

And so I happily showed my teachers the list of 50 most common characters to my teachers. Only to sense a bit of reluctance.

“Dan, these characters might appear frequently, but some of these characters can be hard for young kids to visualise.”

– Teacher Jia Jia

Let’s look at the list of top 10 most common characters in the P1A textbook:

To an adult, characters like 什, 么, 是, 的 are obvious and it’s hard to imagine anyone struggling with them, but to a child, the easiest words to remember are concrete items, ideally things that they can see, touch, using multiple senses.

In addition, I neglected a key difference between English and Chinese. In English, words largely have distinct meanings, and combining different words typically does not change the meaning. In Chinese, each character is nuanced, and combining different characters together results in different meanings. Even 上楼 and 楼上 have different meanings, despite comprising the same characters.


Introducing chunking, a memory technique where we take individual pieces of information and group them intro larger units to make remembering easier. For instance, it’s much easier to remember a phone number if we break it down into chunks (9120-62-62) than if we tried to remember the sequence 9-1-2-0-6-2-6-2.

Similarly, from experience, students tend to find it easier to remember Chinese phrases (词语) that consist of two characters rather than individual characters. In addition, phrases tend to have a specific meaning that don’t change based on the following character, making it easier for students to remember.

Approach #2 – High Frequency Phrases

Given the above, we decided to try an alternative approach of looking at high frequency phrases instead of characters. While this made sense from an pedagogical perspective, it introduced coding challenges.

Given a sentence “我喜欢上学”, determining how frequent a character appeared was trivial since each character is a single unit. So the characters “我”, “喜”, “欢”, “上”, “学” each appear once.

But how about phrases? A human could easily tell you that the above sentence contains “喜欢” and “上学”, but how would a computer know that an item like “欢上” is invalid? What about the phrase “喜欢上”?

One approach would be to have a human manually inspect and split every sentence into characters and phrases (e.g. converting the above to “我”, “喜欢”, “上学”), before using the frequency converter, though this would take an extremely long time.

I won’t bore you with exact implementation details, except to say that we created a reusable program that allows us to quickly determine high-frequency phrases from any body of text. We will be revisiting this topic in near future because while high-frequency phrases will seem extremely simple to an adult, revising them is extremely high-value if your child has limited time.

What did you think of our approach? Feel free to leave us a comment with your feedback and suggestions!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: