New step: Add noise to number
Assignee
Reporter
Sprint
Description
Steps to reproduce
relates to
Activity
Petr POHL November 30, 2023 at 3:46 PM
Closing.
Petr POHL November 28, 2023 at 2:23 PMEdited
Tested CloverDX Server 6.3.0.2
Integer
✓ Noise amount: 0
✓ Noise amount: 2147483647
✓ Noise amount: -2147483648
✓ Noise amount: 10%
✓ Noise amount: 0%
✓ Noise amount: 2147483647%
✓ Noise amount: -2147483648%
Noise amount: 2147483648 / 2147483648%Error: Literal '…' is out of range for type 'int'
Noise amount: -2147483649 / -2147483649%Error: Literal '…' is out of range for type 'int'
Noise amount: -2147483649 % / 2 147 483 647Error: Invalid value '...' - character ' ' is not allowed
Noiser amount: 2,147,483,647Error: Invalid value '...' - character ',' is not allowed
Decimal
✓ Noise amount: 0
✓ Noise amount: 3.141592653589793238462643383279502884197
✓ Noise amount: 2147483647
✓ Noise amount: -2147483648
✓ Noise amount: 9223372036854775807.14
✓ Noise amount: -9223372036854775808.23
✓ Noise amount: 10%
✓ Noise amount: 0%
✓ Noise amount: 2147483647%
✓ Noise amount: -2147483648%
Noise amount: 2147483648 / 2147483648% / -2147483649 / -2147483649%Error: Literal '...' is out of range for type 'int'
Noise amount: 12345678901234567890123456789012.1234567890RUNTIME_ERROR: Number is out of available precision [32,10], value: 7817459280670780923871522881531.0900000000000000, Columns: decimal=95.09
Noiser amount: 9223372036854775808,0 / -9,223,372,036,854,775,808.19Error: Invalid value '...' - character ',' is not allowed
Noiser amount: -9 223 372 036 854 775 808,19 / -9 223 372 036 854 775 808.19Error: Invalid value '...' - character ' ' is not allowed
✓ converted integer from decimal
✓ converted decimal from integer
✓ converted decimal from string
✓ converted integer from string
Jiri Trnka November 27, 2023 at 2:02 PM
New step added.
Upper bound is fixed to match our data generator implementation.
Implement a new step that will allow Wrangler users to anonymize their data in integer or decimal columns by adding random noise to the values in their data.
The step will support decimal and integer columns only. Other data types will be handled by separate steps.
Step parameters
Input column: a single column to anonymize. Must allow seleciton of integer and decimal columns.
Noise amount: determines how much noise to add to the original value. The amount will be specified in two different ways:
As a constant: e.g., “20”, “500”. Positive and negative values must be allowed, but the sign is ignored by the algorithm (see below).
As a percentage: e.g., “10%”, “5%” etc. Positive and negative values must be allowed but the sign will be ignored.
GUI properties
Step name: Add noise to number
Step description: Add random noise to a number
Long description: Anonymize (mask data) in integer and decimal columns by adding random noise to the original value.
Step label in step list:: Add noise to ‘column’
Category: Anonymization
Toolbar placement: inside new top-level icon for the Anonymization category
Seach keywords: gdpr anonymize pseudonymize fuzzing randomize sanitize mask
Algorithm
The step will generate random numbers with uniform distribution in the range defined by the “noise amount” parameter.
If noise amount is specified as constant, the new value will be generated in the range
[$column - abs(noise_amount), $column + abs(noise_amount))
(the interval is closed on the lower bound, open on the upper bound).If noise amount is specified as percentage, the new value will be generated in the range
[$column - $column * abs(percent) / 100, $column + $column * abs(percent) / 100)
(closed on lower, open on upper bound).For additional information about the generated values see and the new CTL functions proposed there.
Notes
Random seed for the step will be derived in the following way:
When editing a job, the seed will be the same and will not “jump” when clicking on different steps in the job.
During runtime, the seed will be “random” - i.e., we will use the default initializer for the random number generator in Java.
In the future we may add global settings for random seed into Wrangler, but this will not be implemented at this point.
We must make sure not to overflow and clamp values to min/max for given data type.
We must generate decimal values with proper number of digits after decimal point. Internally, Wrangler always does 32.10 decimals, but in many cases the data will have format that restricts that. We must use the format to determine how many digits after decimal point and then round (or truncate) the random numbers to given number of decimal places.