Obfuscating Datasets

When working with large-scale NLP pipelines, obfuscating text line-by-line in a simple loop can become a significant performance bottleneck. To solve this, TMallet integrates directly with the Hugging Face ecosystem, allowing you to run optimized obfuscation passes over entire datasets.

text-mallet offers two primary methods for dataset-level operations: a standard in-memory batch mapping approach and a fault-tolerant chunked streaming approach for massive datasets.

Standard In-Memory Mapping

The obfuscate_dataset method executes an optimized batch operation over an active, in-memory Hugging Face Dataset object. Under the hood, it leverages the high-performance dataset.map() function to apply the loaded obfuscation algorithm concurrently across rows.

obfuscated_dataset = tmallet.obfuscate_dataset(
    dataset=my_dataset,
    column="text",
    column_obfuscated="obfuscated_text",
    batch_size=32,
    num_proc=4
)

Arguments

Argument

Type

Default

Description

dataset

Dataset

Required

The Hugging Face dataset collection containing your raw data.

column

str

Required

The name of the column containing the raw source text.

column_obfuscated

str

Required

The name of the new target column where obfuscated results will be stored.

batch_size

int

10

Number of examples forwarded to the obfuscator simultaneously in a single block.

num_proc

int

None

Number of CPU cores to spawn for parallel processing. If None, uses a single process.

Fault-Tolerant Chunked Streaming

For massive datasets that exceed RAM capacity or are hosted directly on the Hugging Face Hub, loading everything into memory at once is impractical. The obfuscate_dataset_by_chunk method resolves this by streaming the dataset incrementally.

Furthermore, it implements a checkpointing system. As it processes each chunk, it saves the intermediate state to disk. If your process gets interrupted due to network errors, hardware timeouts, or rate limits, re-running the script will automatically skip completed chunks and resume exactly where it left off.

from pathlib import Path

large_obfuscated_dataset = tmallet.obfuscate_dataset_by_chunk(
    dataset_repo="wikitext",
    column="text",
    column_obfuscated="masked_text",
    save_chunks_to_folder=Path("./checkpoints"),
    dataset_config="wikitext-103-raw-v1",
    dataset_split="train",
    chunk_size=5000
)

Arguments

Argument

Type

Default

Description

dataset_repo

str

Required

Hugging Face Hub repository ID (e.g., "imdb") or a local path directory.

column

str

Required

The name of the column containing the raw source text.

column_obfuscated

str

Required

The target column key where the obfuscated text is saved.

save_chunks_to_folder

Path

Required

Directory path on your local disk where chunk checkpoints will be saved/restored.

dataset_config

str

None

Sub-dataset configuration or subset name (passed directly to load_dataset).

dataset_split

str

"train"

The data split to target (e.g., "train", "validation", "test").

chunk_size

int

5000

The number of examples collected into a temporary block before executing obfuscation and saving a checkpoint.

batch_size

int

100

The inner batch size sent directly to .map() within each chunk pipeline.

num_proc

int

None

CPU core parallelism configuration split handled per chunk.

num_samples

int

None

An optional ceiling to cap total evaluated elements (useful for quick testing runs).

Note

When using chunked streaming, if a partial checkpoint folder is detected, the method prints a Loading checkpoint... status and automatically forwards the generator iterator over those records to guarantee that no data duplications or omissions occur.