ToxiCN Family

Chinese Toxic Language Benchmarks — text & multimodal

Maintained by Junyu Lu · Dalian University of Technology

ToxiCN

ACL 2023

Facilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks

Paper Code

ToxiCN_MM

NeurIPS 2024

Towards Comprehensive Detection of Chinese Harmful Memes

Paper Code

§ 1About the Project

Online toxic language causes tangible harm to individuals and communities, and reliable detection underpins responsible content moderation as well as the safety alignment of language models. Progress in Chinese, however, has long been bottlenecked by the lack of large-scale, fine-grained, openly available resources — particularly for indirect phenomena (homophones, abbreviated slurs, sarcasm, dog-whistle references) and for the multimodal wild west of harmful memes.

The ToxiCN Family is our ongoing effort to close this gap along two complementary axes:

(i) Text — ToxiCN (ACL 2023): a hierarchical taxonomy (toxic / hate / targeted group / expression form) paired with a manually curated Chinese corpus covering both direct and indirect toxicity, and a knowledge-enhanced baseline (TKE).

(ii) Multimodal — ToxiCN_MM (NeurIPS 2024): a 12K image–text meme dataset annotated for harmful types and modality combinations, together with a Multimodal Knowledge-Enhancement Detector designed for Chinese cultural context.

Both resources are intended as reproducible reference points rather than final solutions, and have since been adopted by a growing line of follow-up work, summarised in the Cited By section below.

§ 2ToxiCN — ACL 2023

PaperFacilitating Fine-grained Detection of Chinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks

Junyu Lu, Bo Xu, Xiaokun Zhang, Changrong Min, Liang Yang, Hongfei Lin

Abstract

The widespread dissemination of toxic online posts is increasingly damaging to society. However, research on detecting toxic language in Chinese has lagged significantly due to limited monolingual resources. In this work, we constructed ToxiCN, a comprehensive Chinese dataset that includes both direct and indirect toxic samples, annotated based on a fine-grained hierarchical taxonomy. Furthermore, we propose a benchmark model — Toxic Knowledge Enhancement (TKE) — that incorporates an insult lexicon to enhance toxic language detection. Extensive experiments demonstrate the dataset's quality and the strength of TKE.

Hierarchical Taxonomy

Level 1 — Toxic vs. Non-toxic — Binary judgement on whether a post is toxic in any form.
Level 2 — Hate vs. General Offensive — Among toxic posts, separate group-targeted hate speech from general offensive language (insults without a protected-group target).
Level 3 — Targeted Group — For hate speech, the protected group attacked: gender, race, region, LGBTQ, others.
Level 4 — Expression Form — Direct expression vs. indirect expression (cloaked: homophones, abbreviations, irony, dog-whistles).

Dataset

ToxiCN is a manually annotated Chinese corpus of online posts spanning multiple platforms (e.g. Zhihu, Tieba). Each post is labelled under a four-level hierarchical scheme so that downstream models can be trained or evaluated at any granularity, from binary toxic detection to fine-grained targeted-group / expression-form classification.

Total samples	12,011 online posts
Toxic samples	6,461 (≈53.8%) — hate or general offence
Annotation axes	4-level hierarchical taxonomy
Target categories	gender · race · region · LGBTQ · others
Language	Simplified Chinese
License	released for academic research

Insult Lexicon

Beyond direct slurs harvested from prior lexicons, ToxiCN systematically derives an extended Chinese insult lexicon by tracing how online users disguise toxic intent. We catalogue six recurring derivation patterns and treat each derived form as a first-class lexicon entry, so downstream models can resolve them back to their actual referents:

Derivation Patterns

Homophonic substitution — Replace one or more characters with phonetically identical / similar ones to bypass keyword filters (e.g. 默 ≈ 黑·犬).
Compositional decomposition — Split a target character into its sub-components, then re-assemble in surface text (e.g. 仙女 = 小·仙·女).
Cross-lingual abbreviation — Use Pinyin initials, English / mixed-script abbreviations to encode a slur (e.g. txl → 同性恋).
Hybrid Chinese–English splicing — Fuse a Chinese morpheme with an English fragment to reconstruct a slur (e.g. ni + ger → ni哥).
Historical / cultural allusion — Re-purpose a region or historical name to demean a group (e.g. 南满 → 南蛮).
Conspiracy / meme reference — Invoke a memeified narrative as a coded slur (e.g. Kalergi → racial conspiracy meme).

Worked Examples

Term	Literal Meaning	Composition	Actual Meaning	Category
默(mò)	silence	黑(hēi) 犬(quǎn) → black dog	n*gger	racial
南(nán) 满(mǎn)	South Manchu	南满 → 南蛮(mán)	southern barbarians	regional
蠢驴	silly donkey	—	foolish people	general
txl	txl	txl → 同(tóng) 性(xìng) 恋(liàn)	gay	lgbtq
ni 哥(gē)	ni brother	ni + ger → n*gger	n*gger	racial
小(xiǎo) 仙(xiān) 女(nǚ)	fairy	—	shrew	sexual
凯(kǎi) 勒(lè) 奇(qí)	Kalergi	—	Kalergi Plan	racial

Sensitive English referents are masked (e.g. n*gger) for public display; full forms are kept inside the released lexicon files.

§ 3ToxiCN_MM — NeurIPS 2024

PaperTowards Comprehensive Detection of Chinese Harmful Memes

Junyu Lu, Bo Xu, Xiaokun Zhang, Hongbo Wang, Haohao Zhu, Dongyu Zhang, Liang Yang, Hongfei Lin

Abstract

We introduce the definition of Chinese harmful memes — multimodal units consisting of an image and Chinese inline text that have the potential to cause harm to an individual, an organisation, a community, a social group, or society as a whole. These memes range from overt offence to subtle stereotypes, often reflecting and reinforcing underlying negative values on the Chinese Internet. To support research on detecting them, we construct ToxiCN MM, a 12,000-sample dataset annotated along two axes — harmful types (targeted harmful, general offence, sexual innuendo, dispirited culture) and modality combination (text–image fusion, harmful text only, harmful image only, both). We further propose a Multimodal Knowledge Enhancement Detector that incorporates contextual information of meme content — generated by an LLM — to better understand Chinese memes.

Harmful-Type Taxonomy

Targeted Harmful — Memes that attack a specific individual, group, or social category — the most common harmful type on Chinese platforms.
General Offence — Insults, profanity, or aggressive content not directed at a particular protected group.
Sexual Innuendo — Implicit or explicit sexual content delivered through visual metaphor, character substitution, or suggestive composition.
Dispirited Culture — Memes propagating nihilism, self-harm framing, or anti-aspirational narratives that erode social well-being.

Dataset

ToxiCN_MM is a Chinese harmful-meme dataset of 12,000 image–text pairs collected from public online sources. Memes are annotated for both harmful type and modality combination, enabling fine-grained study of where toxicity actually arises (text-only, image-only, or only after fusion). Version 2.0 (Dec. 2024) re-annotates <1% ambiguous samples and additionally releases the specific attacked targets for targeted harmful memes.

Total samples	12,000 image–text meme pairs
Harmful samples	annotated along harmful type and modality
Annotation axes	harmful type × modality combination
Target categories	targeted harmful · general offence · sexual innuendo · dispirited culture
Language	Simplified Chinese
License	CC BY-NC-ND 4.0 (academic only)

§ 4Ethics Statement

The ToxiCN family is released solely to support research on Chinese toxic-language and harmful-meme detection, content moderation, and the safety alignment of language and vision-language models. The resources must not be used to generate, amplify, or weaponise toxic content, nor to surveil or profile individuals. All samples were collected from publicly accessible platforms with personally identifiable information removed; annotators were briefed about the disturbing nature of the material, paid fairly, and could withdraw at any time.

Because the corpora inevitably reflect the user bases and moderation policies of their sources, models trained on them should not be deployed as the sole arbiter of moderation decisions. The opinions and findings contained in the samples should not be interpreted as representing the views of the authors. We acknowledge the risk of malicious actors attempting to reverse-engineer cloaked slurs or memes, and sincerely hope users will employ the datasets responsibly. Any content that infringes copyright or other intellectual-property rights will be removed upon request.

The datasets and example outputs contain offensive and potentially traumatising language and imagery — reader discretion is advised.

§ 5Cited By — curated, grouped by direction

Below we highlight a selection of representative follow-up works that build on the ToxiCN family, grouped by research direction. We are grateful to the many researchers whose attention and follow-up work have given the ToxiCN family its continued life.

Traditional Toxicity Detection

Standard toxic / hate / offensive classification on ToxiCN-style Chinese benchmarks, including LLM-based and knowledge-augmented systems.

Cloaked / Perturbed Toxicity Detection

Detecting indirect or adversarially obfuscated toxicity — homophones, abbreviations, character substitutions and dog-whistle references that evade surface-form classifiers.

Span-level Toxicity Detection

Token-/span-level localisation of the toxic component within a longer post, beyond document-level binary judgement.

STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech DetectionACL 2025

Detoxification

Generating non-toxic rewrites of toxic Chinese text while preserving the original communicative intent.

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity MitigationNeurIPS 2026
Overview of the Multilingual Text Detoxification Task at PAN 2025CEUR Workshop Proceedings 2025
Multilingual and Explainable Text Detoxification with Parallel CorporaCOLING 2025
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent RewritesEMNLP 2025

The list is curated by the authors and grouped by research direction. Last update: 2026-05-29.

§ 6Resources

§ 7BibTeX

ToxiCN ACL 2023

@inproceedings{lu-etal-2023-facilitating,
    title = "Facilitating Fine-grained Detection of {C}hinese Toxic Language: Hierarchical Taxonomy, Resources, and Benchmarks",
    author = "Lu, Junyu  and  Xu, Bo  and  Zhang, Xiaokun  and  Min, Changrong  and  Yang, Liang  and  Lin, Hongfei",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.898",
    pages = "16235--16250"
}

ToxiCN_MM NeurIPS 2024

@inproceedings{lu2024towards,
    title     = {Towards Comprehensive Detection of Chinese Harmful Memes},
    author    = {Junyu Lu and Bo Xu and Xiaokun Zhang and Hongbo Wang and Haohao Zhu and Dongyu Zhang and Liang Yang and Hongfei Lin},
    booktitle = {The Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year      = {2024},
    url       = {https://openreview.net/forum?id=PSDXcYjrkO}
}