Hacking Chinese With High-Frequency Phrases

Over the last two weeks, I’ve received a number of enquiries from P1 mothers on how to help their children learn Chinese. Besides P1 being the first major milestone for children, the pandemic has also made it harder for children to use and learn Chinese organically.

My typical response is to focus on both oral/listening and vocabulary. If your child cannot understand what his or her primary school teacher is saying, Chinese lessons are going to be both useless and boring. Meanwhile, a strong vocabulary foundation makes understanding and using Chinese simple.

We’ve been on the “Vocabulary is the key” boat for a long time, which was what inspired us to create a Chinese clone of Wordle to help primary school students revise vocabulary daily. But what happens if a child hasn’t had that much preparation prior to P1?

Is there a way to smartly accelerate the process of vocabulary learning?

TLDR: High-frequency lists are a common way to learn a language, but most lists focus on characters (单词) rather than phrases (词语), which can be challenging for young children. We used Data Science to determine the 50 most common phrases (词语) that appear in the P1A textbook, spelling lists and assessment books.

Read on if you’re interested in our methodology, and here’s the top ten most common phrases. By definition, high-frequency phrases will seem extremely simple to an adult since they are so commonplace. Yet this also makes revising them extremely high-value if your child has limited time.

1. 喜欢 (xǐ huɑn) - to like
2. 什么 (shén me) - what
3. 妈妈 (mā ma) - mother
4. 今天 (jīn tiān) - today
5. 爸爸 (bà ba) - father
6. 老师 (lǎo shī) - teacher
7. 星期 (xīng qī) - week
8. 哥哥 (xǐ huɑn) - older brother
9. 生日 (shēng rì) - birthday
10. 我们 (wǒ men) - we

80-20 Rule

One of my favourite principles is the 80-20 rule, which states that 80% of outcomes come from 20% of actions. 80% of exam questions come from 20% of the textbook material (hence the concept of “spotting questions”), 80% of a company’s revenue comes from 20% of its customers, etc.

So how can we identify the 20% of vocabulary that appear 80% of the time to “hack Chinese”?

High-Frequency Characters

My first thought was to look for high-frequency characters. This is a common strategy for improving English vocabulary – by focusing on the most commonly words, kids are able to improve reading dramatically.

High frequency words are a common strategy for learning English

So surely this should work for Chinese?

I did some research on Chinese high-frequency lists, and a disclaimer I found is that high-frequency lists are extremely dependent on the material that the lists were derived from, as a list that is built from formal learning texts would differ from a list build from storybooks, etc. That introduces the problem below.

The unique difficulties facing Singaporean kids

If you are were born and raised in Singapore, you might not realise the peculiar challenges facing our children when they learn Chinese.

Singapore has one of the largest Overseas Chinese populations in the world and 70+% of our population is ethnically Chinese, yet the primary language used is English. Many of Singapore’s official Chinese textbooks, learning materials and standards are heavily influenced by China, yet the majority of our day-to-day reading is English.

Hence, materials created for native speakers in China and Taiwan are too difficult, while content meant for foreign learners are too simple. So how do we find a high-frequency list that is suitable for Singaporean kids?

Since we can code, the obvious first approach was to create our own high-frequency list based on the most common words that a P1 child would encounter.

Based on our experience, the typical P1 child mainly reads two things in Chinese: 1. the P1 textbook and 2. his or her Chinese homework (and yes, we agree this is a sad state of affairs). Thus, we created a text corpus by combining the P1A textbook together with the thousands of P1 questions in our digital database with appropriate weights. And lastly, we ran the text corpus into a frequency counter to derive the most common characters.

It’s extremely easy to create a frequency counter (this is for the textbook only)

Houston, we have a problem

And so I happily showed my teachers the list of 50 most common characters to my teachers. Only to sense a bit of reluctance.

“Dan, these characters might appear frequently, but some of these characters can be hard for young kids to visualise.”

– Teacher Jia Jia

Let’s look at the list of top 10 most common characters in the P1A textbook:

To an adult, characters like 什, 么, 是, 的 are obvious and it’s hard to imagine anyone struggling with them, but to a child, the easiest words to remember are concrete items, ideally things that they can see, touch, using multiple senses.

In addition, I neglected a key difference between English and Chinese. In English, words largely have distinct meanings, and combining different words typically does not change the meaning. In Chinese, each character is nuanced, and combining different characters together results in different meanings. Even 上楼 and 楼上 have different meanings, despite comprising the same characters.


Introducing chunking, a memory technique where we take individual pieces of information and group them intro larger units to make remembering easier. For instance, it’s much easier to remember a phone number if we break it down into chunks (9120-62-62) than if we tried to remember the sequence 9-1-2-0-6-2-6-2.

Similarly, from experience, students tend to find it easier to remember Chinese phrases (词语) that consist of two characters rather than individual characters. In addition, phrases tend to have a specific meaning that don’t change based on the following character, making it easier for students to remember.

Approach #2 – High Frequency Phrases

Given the above, we decided to try an alternative approach of looking at high frequency phrases instead of characters. While this made sense from an pedagogical perspective, it introduced coding challenges.

Given a sentence “我喜欢上学”, determining how frequent a character appeared was trivial since each character is a single unit. So the characters “我”, “喜”, “欢”, “上”, “学” each appear once.

But how about phrases? A human could easily tell you that the above sentence contains “喜欢” and “上学”, but how would a computer know that an item like “欢上” is invalid? What about the phrase “喜欢上”?

One approach would be to have a human manually inspect and split every sentence into characters and phrases (e.g. converting the above to “我”, “喜欢”, “上学”), before using the frequency converter, though this would take an extremely long time.

I won’t bore you with exact implementation details, except to say that we created a reusable program that allows us to quickly determine high-frequency phrases from any body of text. We will be revisiting this topic in near future because while high-frequency phrases will seem extremely simple to an adult, revising them is extremely high-value if your child has limited time.

What did you think of our approach? Feel free to leave us a comment with your feedback and suggestions!

A non-technical guide to Artificial Intelligence

Artificial Intelligence or AI is the hottest buzzword in most industries, and education is no exception. In schools, MOE is working on an “AI-enabled adaptive learning system to support teaching and learning”, while at home, many parents use websites like KidStartNow’s Pet Battle or Koobits to revise intelligently.

But have you ever wondered what exactly is AI? In this post, we give parents a non-technical rundown of AI using an example everyone can appreciate – getting our kids to read more.


AI has notoriously many definitions, but I like IBM’s explanation that AI is using computers to mimic the problem-solving and decision-making capabilities of the human mind.

To the average person, AI means robots like Skynet in Terminator or J.A.R.V.I.S in Ironman – intelligent machines that are indistinguishable from the human mind. That is called Strong AI, and what most don’t realise is we are still far from that. Rather, most AI applications today are Weak AI, which is focused on teaching machines to do a specific task like sweeping your floor or assessing your child’s Chinese pronunciation. 

Pro-tip: Note that Weak AI does not mean that the AI does the task poorly, just that its intelligence is confined to a narrow scope. For instance, chess-playing AI is stronger than the best human players but it is considered a Weak AI as it’s only good at playing Chess. 

Strong AI
Weak AI

In today’s post, we will be exploring two broad kinds of AI – Expert System and Machine Learning, with the goal of building an intelligent system that can choose a good book for a 6 year old girl to read.


Expert System is an old-school AI system that emulates the decision-making ability of a human expert, typically through if-else rules.

So let’s say I’m the robot, and my wife is training me to go to the bookstore to select a book that is both educational and also appealing to a 6 year old girl. You can think of me as a proxy for a robot.

My wife, being an expert on both shopping and what my daughter likes, could write down a list of rules that help me select the right book. For instance, I could

  1. Consider only books that are cheaper than $10, have pictures and do not have pinyin
  2. Reject books if they have more than two sentences per page or contain overly complex vocabulary (based on MOE syllabus)
  3. For each book, give 1 bonus point if it is about a topic my daughter likes (e.g. animals, princesses, fantasy). So a book with animals and princesses is worth 2 points.
  4. Select the book with the highest score. In the event of a tie, choose the cheapest book with the highest score.

Congratulations – we have just created a basic Expert System!

At this point, you might go – “Dan, that doesn’t sound very intelligent”. But while rule-based systems are rudimentary, they work well for certain domains like education and healthcare.

For instance, the KidStartNow vocab revision app combines rule-based systems with forgetting curve models to track the words your child knows and the optimal set of questions to review.


Machine Learning is another kind of AI and is the cool kid on the block, and is basically teaching a computer to identify patterns from examples in data and make predictions (see youtube video below for a great explanation on what is Machine Learning).

Alright, let’s go back to the book selection example. What if my wife doesn’t actually know what sort of books our daughter likes – how should she train me to go to the bookstore to buy books?

One way would be to first show my daughter a list of books that we have at home, and for each book, ask if she likes it or not. After showing her enough books and recording her preferences, I will naturally gain an intuition of what she likes, which I can use to select a book with reasonable accuracy.

But wait, machines aren’t as smart as humans – we can’t simply tell a machine that my daughter likes Three Little Pigs, and have it automatically understand why. 

One thing we could do is associate each book with certain identifying features – for instance, a Three Little Pigs story would be a book about animals that has pictures, while the Frozen novel would be a book about princesses without pictures. This way, when we tell the machine that my daughter likes Three Little Pigs, it is able to start to reason “maybe she likes animal books with pictures”. And all we need to do is repeat the process with a large amount of books (aka data).

TitleCategoryPicturebookDaughter likes it?
Three Little PigsAnimalsYesYes
Three Little Pigs NovelAnimalsNoNo
Cinderella NovelPrincessNoYes
Frozen NovelPrincessNoYes

Congratulations – we have just created a basic Machine Learning System that can predict what books my daughter will like!


You might be wondering: the approach we just described sounds relatively simple, and how could that possibly work? The answer is data.

In his AI course, famous AI scientist Andrew Ng talks about how the rising amount of data, together with cheap computation power and improvement in algorithms, is powering rapid improvements in Machine Learning performance, especially in the field of Deep Learning.

Given sufficient amounts of good data, we can train machines to do very specific tasks like personalising a Spotify music playlist or predict bank fraud. In the next section, we will talk about specifically how machine learning is used in the education space.

Machine Learning In Education

Speaking Mandarin is a big problem for many Singaporeans given that the majority of families now speak predominantly English at home. For many preschool parents, a concern is that their kids are speaking Chinese with an English or “ang-moh” accent. At KidStartNow, we are working on a machine learning audio pronunciation feature, where students can record and upload an audio clip, and our system can determine both accuracy of pronunciation as well as fluency and dictation.

Another use of machine learning in the education space is in universities, where AI can identify struggling students that are at risk of dropping out, so that officers can provide academic support. The way it works is that universities train a machine learning system with data from previous years, and it learns to predict at-risk dropouts from information like attendance records, grades and socio-demographic information (controversial).


While Artificial intelligence has been extremely hyped over the last few years, we believe it has transformative potential in the education space, and hope this non-technical explanation has been helpful

At KidStartNow, we believe that the secret to improving in Chinese is through effective revision – that’s why every time your child uses our vocabulary revision app, we track his or her progress, and then use AI to personalise an optimal learning plan. If you are interested in finding out more about our app or regular Chinese enrichment classes at Bedok, please leave your details below and we will contact you within 2 working days.