New step: Add noise to number

Assignee

Reporter

Description

Implement a new step that will allow Wrangler users to anonymize their data in integer or decimal columns by adding random noise to the values in their data.

The step will support decimal and integer columns only. Other data types will be handled by separate steps.

Step parameters

  • Input column: a single column to anonymize. Must allow seleciton of integer and decimal columns.

  • Noise amount: determines how much noise to add to the original value. The amount will be specified in two different ways:

    • As a constant: e.g., “20”, “500”. Positive and negative values must be allowed, but the sign is ignored by the algorithm (see below).

    • As a percentage: e.g., “10%”, “5%” etc. Positive and negative values must be allowed but the sign will be ignored.

GUI properties

  • Step name: Add noise to number

  • Step description: Add random noise to a number

  • Long description: Anonymize (mask data) in integer and decimal columns by adding random noise to the original value.

  • Step label in step list:: Add noise to ‘column’

  • Category: Anonymization

  • Toolbar placement: inside new top-level icon for the Anonymization category

  • Seach keywords: gdpr anonymize pseudonymize fuzzing randomize sanitize mask

Algorithm

The step will generate random numbers with uniform distribution in the range defined by the “noise amount” parameter.

If noise amount is specified as constant, the new value will be generated in the range [$column - abs(noise_amount), $column + abs(noise_amount)) (the interval is closed on the lower bound, open on the upper bound).

If noise amount is specified as percentage, the new value will be generated in the range [$column - $column * abs(percent) / 100, $column + $column * abs(percent) / 100) (closed on lower, open on upper bound).

For additional information about the generated values see and the new CTL functions proposed there.

Notes

  • Random seed for the step will be derived in the following way:

    • When editing a job, the seed will be the same and will not “jump” when clicking on different steps in the job.

    • During runtime, the seed will be “random” - i.e., we will use the default initializer for the random number generator in Java.

    • In the future we may add global settings for random seed into Wrangler, but this will not be implemented at this point.

  • We must make sure not to overflow and clamp values to min/max for given data type.

  • We must generate decimal values with proper number of digits after decimal point. Internally, Wrangler always does 32.10 decimals, but in many cases the data will have format that restricts that. We must use the format to determine how many digits after decimal point and then round (or truncate) the random numbers to given number of decimal places.

Steps to reproduce

None

Activity

Show:

Petr POHL November 30, 2023 at 3:46 PM

Closing.

Petr POHL November 28, 2023 at 2:23 PM
Edited

Tested CloverDX Server 6.3.0.2

Integer

✓ Noise amount: 0
✓ Noise amount: 2147483647
✓ Noise amount: -2147483648
✓ Noise amount: 10%
✓ Noise amount: 0%
✓ Noise amount: 2147483647%
✓ Noise amount: -2147483648%

Noise amount: 2147483648 / 2147483648%
Error: Literal '…' is out of range for type 'int'

Noise amount: -2147483649 / -2147483649%
Error: Literal '…' is out of range for type 'int'

Noise amount: -2147483649 % / 2 147 483 647
Error: Invalid value '...' - character ' ' is not allowed

Noiser amount: 2,147,483,647
Error: Invalid value '...' - character ',' is not allowed

Decimal

✓ Noise amount: 0
✓ Noise amount: 3.141592653589793238462643383279502884197
✓ Noise amount: 2147483647
✓ Noise amount: -2147483648
✓ Noise amount: 9223372036854775807.14
✓ Noise amount: -9223372036854775808.23
✓ Noise amount: 10%
✓ Noise amount: 0%
✓ Noise amount: 2147483647%
✓ Noise amount: -2147483648%

Noise amount: 2147483648 / 2147483648% / -2147483649 / -2147483649%
Error: Literal '...' is out of range for type 'int'

Noise amount: 12345678901234567890123456789012.1234567890
RUNTIME_ERROR: Number is out of available precision [32,10], value: 7817459280670780923871522881531.0900000000000000, Columns: decimal=95.09

Noiser amount: 9223372036854775808,0 / -9,223,372,036,854,775,808.19
Error: Invalid value '...' - character ',' is not allowed

Noiser amount: -9 223 372 036 854 775 808,19 / -9 223 372 036 854 775 808.19
Error: Invalid value '...' - character ' ' is not allowed

✓ converted integer from decimal
✓ converted decimal from integer
✓ converted decimal from string
✓ converted integer from string

Jiri Trnka November 27, 2023 at 2:02 PM

New step added.

Upper bound is fixed to match our data generator implementation.

Fixed

Details

Priority

Fix versions

QA Testing

UNDECIDED

Components

Created October 22, 2023 at 11:24 PM
Updated January 23, 2024 at 3:56 PM
Resolved November 27, 2023 at 2:02 PM