Configurations¶
This page provides a comprehensive reference for the configuration models used across different text-mallet algorithms.
POS Filter Configuration ("pos-filter")¶
Configures token masking based on specific Part-of-Speech tags.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
filter_type |
|
|
Action to take on the target tags (e.g., retain them or remove them). |
pos_tags |
|
|
A collection of Universal POS tags targeted by the filter. |
replacement_mechanism |
|
|
Strategy used to replace targeted text (e.g., replacing with the tag name). |
—
Shannon Filter Configuration ("shannon")¶
Configures text filtering using information-theoretic metrics calculated from an underlying language model context.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
threshold |
|
|
The information surprisal ceiling or floor value used to trigger a mask. |
bound |
|
|
Determines if values above ( |
replacement_mechanism |
|
|
Strategy used to replace targeted text. |
max_context_length |
|
|
Max token historical window parsed by the LM for context evaluation. |
output_mi_values |
|
|
If true, outputs calculated mutual information/surprisal values along with the text. |
—
Linear Scramble Configuration ("scramble-BoW")¶
Configures a lightweight, framework-free Bag-of-Words shuffling operation.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
level |
|
|
The text layer boundaries applied to shuffles. Allowed values are typically |
seed |
|
|
Pseudorandom generator state seed to guarantee deterministic text layouts. |
—
Hierarchical Scramble Configuration ("scramble-hier")¶
Configures deep syntactic structural scrambled layouts using dependency trees generated by a spaCy backend.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
strength |
|
|
Rearrangement severity settings. Typical options are |
seed |
|
|
Pseudorandom generator seed utilized when scrambling leaf node distributions. |
Multi-Obfuscation Feature: For parameters that accept a union type of a single value or a list (e.g., str | List[str]), passing a list of choices will cause TMallet to compute multiple obfuscation passes simultaneously. The result will be returned as a nested dictionary keyed by the options instead of a flat string. This is with the single exception of passing POS tags to the POS filter, as this must be an array and will not automatically provide the result as a dict. If you want to compute multuple obfuscation configurations at once, passing multiple parameters in the single config as an array leads to performance gains as the text is only processed a single time where possible.