Text Mallet Documentation¶

text-mallet is a toolkit for transforming text into obfuscated or derived formats while preserving utility for downstream NLP tasks such as classification, retrieval, and topic modeling.

The package focuses on reducing the risk of privacy or copyright infringement by degrading reconstructable information, while retaining task-relevant signals.

Contents¶

Getting Started

Obfuscation Methods

API

Pipeline

Project Info

About

Examples¶

Algorithm: Part-of-Speech Filtering

Original

Data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.

↓ pos-filter ↓

Obfuscated

Data obfuscation AUX DET process ADP VERB ADJ data ADP DET DET way SCONJ PRON AUX ADP DET CCONJ ADJ value ADP ADJ intruders SCONJ ADV AUX ADJ ADP software CCONJ ADJ personnel PUNCT Data masking AUX ADV AUX VERB ADP anonymization PUNCT CCONJ tokenization PUNCT VERB ADP ADJ context PUNCT

Algorithm: Mutual Information Filter

Original

↓ shannon ↓

Obfuscated

Data obfuscation _ _ process _ modifying sensitive data _ such _ _ _ _ _ _ _ _ value _ unauthorized intruders while _ _ usable _ _ _ authorized personnel _ Data masking _ _ _ _ anonymization _ _ tokenization _ depending _ _ context _

Algorithm: Hierarchical Scrambling

Original

↓ scramble-hier ↓

Obfuscated

in or software obfuscation that such it value Data to data personnel little. of while the no is way by process usable still a sensitive unauthorized authorized being modifying intruders , different tokenization can Data or masking also. anonymization as context depending referred be on

Algorithm: Bag of Words

Original

↓ scramble-BoW ↓

Obfuscated

intruders unauthorized Data referred modifying usable as being of or Data is process software it still little sensitive be that value context. different or depending masking a no tokenization, authorized of to or personnel. anonymization, the can data while by is obfuscation on way such also in

Overview¶

Text can be transformed along multiple linguistic dimensions:

Word Forms (surface character sequences)
Syntactic and Morphological Features
Semantic Content
Grammatical Relations
Sequence Structure

Each of these contributes information to the final text. text-mallet provides mechanisms to selectively erode this information, producing representations that are less human-readable but still useful for machine learning tasks.

Different languages rely on these dimensions differently. For example, English depends heavily on word order, while German relies more on morphological variation.

Why Obfuscate Text?¶

Many NLP tasks do not require fully reconstructable text. Tasks such as:

Text classification
Semantic similarity
Topic modeling
Information retrieval

can often operate effectively on degraded or transformed inputs.

This package enables:

Possible use of sensitive or copyrighted data without exposing raw text
Reduced risk of reconstruction from adversarial attacks (e.g. embedding inversion) or thorugh model outputs

Rather than replacing clean data, obfuscated text is intended to complement existing datasets.