Pipeline¶
- class tmallet.TMallet(lang: Literal['en', 'de'] = 'en', prefer_gpu: bool = False)[source]¶
A text obfuscation manager that applies transformations to text.
This class applies selected algorithmic obfuscation techniques (such as POS filtering, bag-of-words scrambling, or information-theoretic filtering) to strings, lists of text, or entire datasets.
- Parameters:
lang (LangConfig) – The language configuration code (e.g., “en”).
prefer_gpu (bool) – Whether spaCy is configured to leverage GPU acceleration.
- load_obfuscator(algorithm: Literal['pos-filter', 'scramble-hier', 'scramble-BoW', 'shannon', 'lemmatize'], config: dict[str, str])[source]¶
Validates configuration and dynamically instantiates an obfuscation algorithm.
- Parameters:
algorithm (str) – The identifier of the obfuscation technique (e.g., ‘pos-filter’).
config (Dict) – Key-value pairings containing parameters for the specific algorithm.
- Returns:
The current class instance to allow for method chaining.
- Return type:
- obfuscate(text: str) dict | str[source]¶
Obfuscates standalone text strings or lists of strings.
Requires an obfuscator to be loaded via load_obfuscator prior to invocation.
- Parameters:
text (Union[List[str], str]) – Single text payload or collection of texts to process.
- Raises:
RuntimeError – If an obfuscator and configuration have not been loaded yet.
- Returns:
The modified, obfuscated text or collection of texts in the form of a dictionary.
- Return type:
Dict
- obfuscate_dataset(dataset: Dataset, column: str, column_obfuscated: str, batch_size: int = 10, num_proc: int | None = None)[source]¶
Maps obfuscation across an entire HuggingFace/compatible dataset object sequentially.
- Parameters:
dataset (Dataset) – The underlying dataset collection containing columns of data.
column (str) – Key of the column containing raw target text.
column_obfuscated (str) – Target base column key for saving the output.
batch_size (int, optional) – Size of chunk arrays processed together. Defaults to 10.
num_proc (Optional[int], optional) – CPU core count split handling parallel tasks. Defaults to None.
- Returns:
A newly updated copy of the dataset containing obfuscation columns.
- Return type:
Dataset
- obfuscate_dataset_by_chunk(dataset_repo: str, column: str, column_obfuscated: str, save_chunks_to_folder: Path, dataset_config: str | None = None, dataset_split: str = 'train', chunk_size: int = 5000, batch_size: int = 100, num_proc: int | None = None, start_index: int = 0, num_samples: int | None = None) Dataset[source]¶
Streams a dataset from the Hub in chunks, obfuscates each chunk, and saves checkpoints to disk for fault tolerance. :param dataset_repo: HuggingFace Hub repo ID or local path for load_dataset. :type dataset_repo: str :param column: Key of the column containing raw target text. :type column: str :param column_obfuscated: Target base column key for saving the output. :type column_obfuscated: str :param save_chunks_to_folder: Directory to save/load disk checkpoints. :type save_chunks_to_folder: Path :param dataset_config: Dataset config/subset name passed to load_dataset. :type dataset_config: Optional[str] :param dataset_split: Split to stream (e.g. “train”, “validation”). Defaults to “train”. :type dataset_split: str :param chunk_size: Number of examples per chunk. Defaults to 5_000. :type chunk_size: int :param batch_size: Inner batch size passed to .map. Defaults to 100. :type batch_size: int :param num_proc: CPU parallelism for .map. Defaults to None. :type num_proc: Optional[int] :param start_index: Index of the first example to process. Defaults to 0. :type start_index: int :param num_samples: Number of examples to process from start_index.
Defaults to None (process until stream is exhausted).
- Returns:
Concatenated dataset of all processed chunks.
- Return type:
Dataset