Pipeline

class tmallet.TMallet(lang: Literal['en', 'de'] = 'en', prefer_gpu: bool = False)[source]

A text obfuscation manager that applies transformations to text.

This class applies selected algorithmic obfuscation techniques (such as POS filtering, bag-of-words scrambling, or information-theoretic filtering) to strings, lists of text, or entire datasets.

Parameters:
  • lang (LangConfig) – The language configuration code (e.g., “en”).

  • prefer_gpu (bool) – Whether spaCy is configured to leverage GPU acceleration.

load_obfuscator(algorithm: Literal['pos-filter', 'scramble-hier', 'scramble-BoW', 'shannon', 'lemmatize'], config: dict[str, str])[source]

Validates configuration and dynamically instantiates an obfuscation algorithm.

Parameters:
  • algorithm (str) – The identifier of the obfuscation technique (e.g., ‘pos-filter’).

  • config (Dict) – Key-value pairings containing parameters for the specific algorithm.

Returns:

The current class instance to allow for method chaining.

Return type:

TMallet

obfuscate(text: str) dict | str[source]

Obfuscates standalone text strings or lists of strings.

Requires an obfuscator to be loaded via load_obfuscator prior to invocation.

Parameters:

text (Union[List[str], str]) – Single text payload or collection of texts to process.

Raises:

RuntimeError – If an obfuscator and configuration have not been loaded yet.

Returns:

The modified, obfuscated text or collection of texts in the form of a dictionary.

Return type:

Dict

obfuscate_dataset(dataset: Dataset, column: str, column_obfuscated: str, batch_size: int = 10, num_proc: int | None = None)[source]

Maps obfuscation across an entire HuggingFace/compatible dataset object sequentially.

Parameters:
  • dataset (Dataset) – The underlying dataset collection containing columns of data.

  • column (str) – Key of the column containing raw target text.

  • column_obfuscated (str) – Target base column key for saving the output.

  • batch_size (int, optional) – Size of chunk arrays processed together. Defaults to 10.

  • num_proc (Optional[int], optional) – CPU core count split handling parallel tasks. Defaults to None.

Returns:

A newly updated copy of the dataset containing obfuscation columns.

Return type:

Dataset

obfuscate_dataset_by_chunk(dataset_repo: str, column: str, column_obfuscated: str, save_chunks_to_folder: Path, dataset_config: str | None = None, dataset_split: str = 'train', chunk_size: int = 5000, batch_size: int = 100, num_proc: int | None = None, start_index: int = 0, num_samples: int | None = None) Dataset[source]

Streams a dataset from the Hub in chunks, obfuscates each chunk, and saves checkpoints to disk for fault tolerance. :param dataset_repo: HuggingFace Hub repo ID or local path for load_dataset. :type dataset_repo: str :param column: Key of the column containing raw target text. :type column: str :param column_obfuscated: Target base column key for saving the output. :type column_obfuscated: str :param save_chunks_to_folder: Directory to save/load disk checkpoints. :type save_chunks_to_folder: Path :param dataset_config: Dataset config/subset name passed to load_dataset. :type dataset_config: Optional[str] :param dataset_split: Split to stream (e.g. “train”, “validation”). Defaults to “train”. :type dataset_split: str :param chunk_size: Number of examples per chunk. Defaults to 5_000. :type chunk_size: int :param batch_size: Inner batch size passed to .map. Defaults to 100. :type batch_size: int :param num_proc: CPU parallelism for .map. Defaults to None. :type num_proc: Optional[int] :param start_index: Index of the first example to process. Defaults to 0. :type start_index: int :param num_samples: Number of examples to process from start_index.

Defaults to None (process until stream is exhausted).

Returns:

Concatenated dataset of all processed chunks.

Return type:

Dataset