Pipeline¶

class tmallet.TMallet(lang: Literal['en', 'de'] = 'en', prefer_gpu: bool = False, model_type: Literal['sm', 'md', 'lg', 'trf'] = 'lg')[source]¶

A text obfuscation manager that applies transformations to text.

This class applies selected algorithmic obfuscation techniques (such as POS filtering, bag-of-words scrambling, or information-theoretic filtering) to strings, lists of text, or entire datasets.

Parameters:

lang (LangConfig) – The language configuration code (e.g., “en”).
prefer_gpu (bool) – Whether spaCy is configured to leverage GPU acceleration.

load_obfuscator(algorithm: Literal['pos-filter', 'scramble-hier', 'scramble-BoW', 'shannon', 'lemmatize'], config: dict[str, str])[source]¶

Validates configuration and dynamically instantiates an obfuscation algorithm.

Parameters:

algorithm (str) – The identifier of the obfuscation technique (e.g., ‘pos-filter’).
config (Dict) – Key-value pairings containing parameters for the specific algorithm.

Returns:

The current class instance to allow for method chaining.

Return type:

TMallet

obfuscate(text: str) → dict | str[source]¶

Obfuscates standalone text strings or lists of strings.

Requires an obfuscator to be loaded via load_obfuscator prior to invocation.

Parameters:: text (Union[List[str], str]) – Single text payload or collection of texts to process.
Raises:: RuntimeError – If an obfuscator and configuration have not been loaded yet.
Returns:: The modified, obfuscated text or collection of texts in the form of a dictionary.
Return type:: Dict

obfuscate_dataset(dataset: Dataset, column: str, column_obfuscated: str, batch_size: int = 10, num_proc: int | None = None)[source]¶

Maps obfuscation across an entire HuggingFace/compatible dataset object sequentially.

Parameters:

dataset (Dataset) – The underlying dataset collection containing columns of data.
column (str) – Key of the column containing raw target text.
column_obfuscated (str) – Target base column key for saving the output.
batch_size (int, optional) – Size of chunk arrays processed together. Defaults to 10.
num_proc (Optional[int], optional) – CPU core count split handling parallel tasks. Defaults to None.

Returns:

A newly updated copy of the dataset containing obfuscation columns.

Return type:

Dataset

obfuscate_dataset_by_chunk(dataset_repo: str, column: str, column_obfuscated: str, save_chunks_to_folder: Path, dataset_config: str | None = None, dataset_split: str = 'train', chunk_size: int = 5000, batch_size: int = 100, num_proc: int | None = None, start_index: int = 0, num_samples: int | None = None) → Dataset[source]¶

Loads a dataset from the Hub in chunks (non-streaming), obfuscates each chunk, and saves checkpoints to disk for fault tolerance.

Parameters:

dataset_repo (str) – HuggingFace Hub repo ID or local path for load_dataset.
column (str) – Key of the column containing raw target text.
column_obfuscated (str) – Target base column key for saving the output.
save_chunks_to_folder (Path) – Directory to save/load disk checkpoints.
dataset_config (Optional[str]) – Dataset config/subset name passed to load_dataset.
dataset_split (str) – Split to load (e.g. “train”, “validation”). Defaults to “train”.
chunk_size (int) – Number of examples per chunk. Defaults to 5_000.
batch_size (int) – Inner batch size passed to .map. Defaults to 100.
num_proc (Optional[int]) – CPU parallelism for .map. Defaults to None.
start_index (int) – Index of the first example to process. Defaults to 0.
num_samples (Optional[int]) – Number of examples to process from start_index. Defaults to None (process until end of dataset).

Returns:

Concatenated dataset of all processed chunks.

Return type:

Dataset