Skip to content

mreichhoff/HanziGraph

Repository files navigation

HanziGraph

HanziGraph is a Chinese/English dictionary and study tool for Chinese language learners. It represents the Chinese language as a graph, in which individual characters are the nodes, and the edges are words. As a concrete example, and would each represent a node, and 确定 would be the edge that connects them.

Looking for Japanese? See the JapaneseGraph branch.

Demo

GraphDemo2024-02-17.mov

Features

HanziGraph can also:

  • Demonstrate character composition with tree diagrams of components and compounds
  • Show sankey diagrams to demonstrate how words are used together
  • Demonstrate usage with example sentences (mostly human-generated, with some AI-generated 🤖)
  • Automatically generate spaced repetition flashcards, and track your study stats
  • Demonstrate word coverage with cumulative frequency graphs
  • Tokenize sentences into words
  • Work offline
  • Run in-browser text-to-speech to demonstrate pronunciation
  • Be installed as a standalone PWA
  • Use light or dark themes, based on OS preferences
  • And more!

Interactive Graph

The nodes and edges (more than 100,000 in all) have data associated with them.

  • For hanzi (nodes), color coding based on tone.
  • Usage frequency data, which can be used to color-code (red: very frequent; blue: less frequent) instead of tones. Word frequency can also be substituted with color coding by HSK level.
  • Definitions, from CEDICT (and in the case of Cantonese, CC-CANTO).
  • Human-generated example sentences, sorted by average word frequency, from tatoeba.
  • For words not present in tatoeba's corpus, AI-generated 🤖 examples are used instead.

Component Breakdowns

Each hanzi can also be diagrammed as a component tree, where each successive level of the tree is a further breakdown of its parent. Compounds using a character as a component are also listed. Colors indicate tones, and when pronunciation (pinyin initial, final, or both) is shared, the shared portion is shown on each connecting edge.

As an example, here's the breakdown for 恐 (kong3), showing how it shares its pinyin final and its tone with its component 巩.

kong3-Components

This can also help answer questions like "which simplified character has the largest number of transitive components?"

Answer: 蠮 (ye1)

ye1-components

This mode is also available as a standalone tool.

Word Relationships

In addition to character relationships expressed through a graph structure, the tool uses collocation data to show how words relate to one another. It expresses those relationships with sankey diagrams. These diagrams can also be thought of as a graph, where each node is a word and each edge is a collocation, with the edge weight representing frequency of use. One example would be 时候 commonly being preceded by . In this case, 时候 and are nodes, and the weight of their connecting edge represents the frequency of the collocation 的时候.

SankeyDiagram

Cumulative Frequency Stats

Curious how much bang-for-your-buck you're getting by learning a given word? The frequency coloring and coverage graphs can help. The coverage graphs indicate what percentage of the language you'd understand if you learned each word in order of frequency up to your search term, where very frequent words make up a disproportionate amount of the spoken language.

StatsGraph

Flashcards

Flashcards can be created from the definitions and example sentences, and either studied in the tool or exported to Anki. The flashcards test both recognition (translating from Chinese to English) and recall (translating from English to Chinese); cloze cards (fill in the blank) are also made. When a new word is being studied, it should often be studied in several contexts, so up to 10 cards are made for a single word or character.

Study stats, including how many words and characters are in your flashcards, and how many words in various frequency buckets are present, are also tracked.

StudyStatsDemo.mov

Commands

In addition to searching by Chinese, English, or Pinyin, HanziGraph can run commands. Currently, the only supported command is:

!random <min_freq_rank || 0> <max_freq_rank || 10,000>

More may be added in the future (e.g., !measure <measure_word>).

As seen on...

You can learn more via a discussion on reddit and on hacking Chinese.

The tool was also recommended on the You Can Learn Chinese podcast. The Japanese version was recommended by Tofugu and The Japan Foundation, Sydney.

Running the code

HanziGraph is a static site hosted on firebase. By default, there's no backend whatsoever, though users can sign in and sync their flashcards across devices (via client-side firestore integration).

To run the static site with no firebase dependencies, one can switch USE_FIREBASE to false in main.js, then:

npm install && npm run build
cd public
python3 -m http.server

To run the firebase version locally, one could set up one's own Firebase project via their quickstart guide, replace the firebase initialization code, and then use firebase emulators:start.

Better build automation is coming soon, which will allow simpler firebase disablement and easier config substitution.

Note that some of the larger data files are partitioned to avoid excessive memory use or network bandwidth (while also avoiding huge numbers of files).

Code Deployment

  • Any pull request will deploy the proposed code change to a preview URL automatically, allowing manual testing.
  • Merges to the main branch are deployed to production automatically.
  • Additional deployment automation, particularly with end-to-end testing, is a future work item.

Project Status

The webapp is still a prototype, but it is functional and can be installed as a PWA or used on the web.

Note that consolidation of the code for the various datasets (Mandarin, Cantonese, Japanese) is ongoing. Upcoming changes will also begin use of lit and pay down technical debt (there's uh...there's a lot of technical debt).

Acknowledgements

Sentence and definition data was pulled from:

  • Tatoeba, which releases data under CC-BY 2.0 FR
  • CEDICT, which releases data under CC BY-SA 4.0. Because of sharealike, the definitions files should be considered released under that license as well.
  • OPUS, specifically the OpenSubtitles, UN parallel, and WikiMatrix corpuses.
  • OpenAI's gpt-3.5-turbo model generated example sentences for ~80,000 words and characters.
  • Japanese definitions were pulled from JMDict; links to their license terms are available on that page.

Character composition data was pulled from cjkvi-ids (specifically, the portion derived from the CHISE project, under their license) and cjk-decomp.

Cantonese frequency data was generated from a spreadsheet found on reddit, HKCanCor via pycantonese, and tatoeba.

Jieba was used to tokenize sentences, including in Cantonese. It is also used in WASM form to tokenize sentences on the frontend.

CytoscapeJS and d3 are helpful for graph visualization. Some icons were based on CSS icons.