Text Mallet Documentation

text-mallet is a toolkit for transforming text into obfuscated or derived formats while preserving utility for downstream NLP tasks such as classification, retrieval, and topic modeling.

The package focuses on reducing the risk of privacy or copyright infringement by degrading reconstructable information, while retaining task-relevant signals.

Contents

API

Project Info

Examples

Algorithm: Part-of-Speech Filtering
Original
Data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.
↓  pos-filter  ↓
Obfuscated
Data obfuscation AUX DET process ADP VERB ADJ data ADP DET DET way SCONJ PRON AUX ADP DET CCONJ ADJ value ADP ADJ intruders SCONJ ADV AUX ADJ ADP software CCONJ ADJ personnel PUNCT Data masking AUX ADV AUX VERB ADP anonymization PUNCT CCONJ tokenization PUNCT VERB ADP ADJ context PUNCT
Algorithm: Mutual Information Filter
Original
Data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.
↓  shannon  ↓
Obfuscated
Data obfuscation _ _ process _ modifying sensitive data _ such _ _ _ _ _ _ _ _ value _ unauthorized intruders while _ _ usable _ _ _ authorized personnel _ Data masking _ _ _ _ anonymization _ _ tokenization _ depending _ _ context _
Algorithm: Hierarchical Scrambling
Original
Data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.
↓  scramble-hier  ↓
Obfuscated
in or software obfuscation that such it value Data to data personnel little. of while the no is way by process usable still a sensitive unauthorized authorized being modifying intruders , different tokenization can Data or masking also. anonymization as context depending referred be on
Algorithm: Bag of Words
Original
Data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.
↓  scramble-BoW  ↓
Obfuscated
intruders unauthorized Data referred modifying usable as being of or Data is process software it still little sensitive be that value context. different or depending masking a no tokenization, authorized of to or personnel. anonymization, the can data while by is obfuscation on way such also in

Overview

Text can be transformed along multiple linguistic dimensions:

  • Word Forms (surface character sequences)

  • Syntactic and Morphological Features

  • Semantic Content

  • Grammatical Relations

  • Sequence Structure

Each of these contributes information to the final text. text-mallet provides mechanisms to selectively erode this information, producing representations that are less human-readable but still useful for machine learning tasks.

Different languages rely on these dimensions differently. For example, English depends heavily on word order, while German relies more on morphological variation.

Why Obfuscate Text?

Many NLP tasks do not require fully reconstructable text. Tasks such as:

  • Text classification

  • Semantic similarity

  • Topic modeling

  • Information retrieval

can often operate effectively on degraded or transformed inputs.

This package enables:

  • Possible use of sensitive or copyrighted data without exposing raw text

  • Reduced risk of reconstruction from adversarial attacks (e.g. embedding inversion) or thorugh model outputs

Rather than replacing clean data, obfuscated text is intended to complement existing datasets.