Banks and Financial Institutions are bound by customer agreements and various regulations like HIPAA, SOX and Gramm-Leach-Bliley Act (GLBA), to protect customer data. Information sharing is restricted based on the governing laws applicable in the country of origination for the data. At the same time data needs to be shared with partners and employees to complete various transactions.
Data Masking Solutions encrypt data to transmit it in secure and fool proof manner. Data can be masked from numerous data sources – ASCII files, VSAM files (EBCDIC Encoded), Databases.
The masked data will be stored in a target repository in a similar file – ASCII, VSAM or Database. Masked data retains the statistical properties, integrity and realism of original data, allowing for effective and efficient testing, development research, and eliminating the risk of disclosure of sensitive data.
Data Masking Tool Functionality
- Easy to use configuration tool
- Build and edit relationships.
- Maintain Referential Integrity in Target repository
- Fault Tolerant workflow – supports restart of a previously suspended batch.
- Supports Horizontal partitioning of data through Data Filters
- Extensible and resilient application architecture.
- Low memory footprint
- Scripting facility to design custom rules by combining various other rules
- Supports EBCDIC character set.
- Enables controlled data sharing
- Facilitates Regulatory Compliances – GLBA/HIPAA/SOX/PIPEDA/Provincial privacy laws
- Data Protection Act (DPA)
- Alleviates security concern when moving production data offshore
- Masks data without compromising on data associations and referential integrity.
- Maintains transparency in its internal repository
- Configuration data is stored on the file system
Thorough Masking on an Outsourced Database
A database going offsite to an outsourced development team might have an extremely thorough masking applied and only the real information absolutely necessary to enable the remote personnel to perform their function would be present.
Data Masking Techniques
Substitution
This technique consists of randomly replacing the contents of a column of data with information that looks similar but is completely unrelated to the real details. For example, the surnames in a customer database could be sanitized by replacing the real last names with surnames drawn from a largish random list.
Substitution is very effective in terms of preserving the look and feel of the existing data. The downside is that a largish store of substitutable information must be available for each column to be substituted. For example, to sanitize surnames by substitution, a list of random last names must be available. Then to sanitize telephone numbers, a list of phone numbers must be available. Frequently, the ability to generate known invalid data (credit card numbers that will pass the checksum tests but never work) is a nice-to-have feature.
Substitution data can sometimes be very hard to find in large quantities – however any data masking software should contain datasets of commonly required items. When evaluating data masking software the size, scope and variety of the datasets should be considered. Another useful feature to look for is the ability to build your own custom datasets and add them for use in the masking rules.
Shuffling
Shuffling is similar to substitution except that the substitution data is derived from the column itself. Essentially the data in a column is randomly moved between rows until there is no longer any reasonable correlation with the remaining information in the row.
There is a certain danger in the shuffling technique. It does not prevent people from asking questions like “I wonder if so-and-so is on the supplier list?” In other words, the original data is still present and sometimes meaningful questions can still be asked of it. Another consideration is the algorithm used to shuffle the data. If the shuffling method can be determined, then the data can be easily “un-shuffled”. For example, if the shuffle algorithm simply ran down the table swapping the column data in between every group of two rows it would not take much work from an interested party to revert things to their un-shuffled state.
Shuffling is rarely effective when used on small amounts of data. For example, if there are only 5 rows in a table it probably will not be too difficult to figure out which of the shuffled data really belongs to which row. On the other hand, if a column of numeric data is shuffled, the sum and average of the column still work out to the same amount. This can sometimes be useful. Shuffle rules are best used on large tables and leave the look and feel of the data intact. They are fast, but great care must be taken to use a sophisticated algorithm to randomise the shuffling of the rows.
Number and Date Variance
The Number Variance technique is useful on numeric or date data. Simply put, the algorithm involves modifying each number or date value in a column by some random percentage of its real value. This technique has the nice advantage of providing a reasonable disguise for the data while still keeping the range and distribution of values in the column to within existing limits. For example, a column of salary details might have a random variance of ±10% placed on it. Some values would be higher, some lower but all would be not too far from their original range. Date fields are also a good candidate for variance techniques. Birth dates, for example, could be varied with in an arbitrary range of ± 120 days which effectively disguises the personally identifiable information while still preserving the distribution. The variance technique can prevent attempts to discover true records using known date data or the exposure of sensitive numeric or date data.
Encryption
This technique offers the option of leaving the data in place and visible to those with the appropriate key while remaining effectively useless to anybody without the key. This would seem to be a very good option – yet, for anonymous test databases, it is one of the least useful techniques.
The advantage of having the real data available to anybody with the key – is actually a major disadvantage in a test or development database. The “optional” visibility provides no major advantage in a test system and the encryption password only needs to escape once and all of the data is compromised. Of course, you can change the key and regenerate the test instances – but outsourced, stored or saved copies of the data are all still available under the old password. Encryption also destroys the formatting and look and feel of the data. Encrypted data rarely looks meaningful, in fact, it usually looks like binary data. This sometimes leads to character set issues when manipulating encrypted varchar fields. Certain types of encryption impose constraints on the data format as well. In effect, this means that the fields must be extended with a suitable padding character which must then be stripped off at decryption time.
The strength of the encryption is also an issue. Some encryption is more secure than others. According to the experts, most encryption systems can be broken – it is just a matter of time and effort. In other words, not very much will keep the national security agencies of largish countries from reading your files should they choose to do so. This may not be a big worry if the requirement is to protect proprietary business information. Never, ever, use a simplistic encryption scheme designed by amateurs.
For example, one in which the letter ‘A’ is replaced by ‘X’ and the letter ‘B’ by ‘M’etc. is trivially easy to decrypt based on letter frequency probabilities.
Nulling Out/Truncating
Simply deleting a column of data by replacing it with NULL values is an effective way of ensuring that it is not inappropriately visible in test environments. Unfortunately it is also one of the least desirable options from a test database standpoint. Usually the test teams need to work on the data or at least a realistic approximation of it. For example, it is very hard to write and test customer account maintenance forms if the customer name, address and contact details are all NULL values. NULL’ing or truncating data is useful in circumstances where the data is simply not required, but is rarely useful as the entire data sanitization strategy.
Masking Out Data Masking data, besides being the generic term for the process of data anonymization, means replacing certain fields with a mask character (such as an X). This effectively disguises the data content while preserving the same formatting on front end screens and reports. For example, a column of credit card numbers might look like:
4346 6454 0020 5379
4493 9238 7315 5787
4297 8296 7496 8724
and after the masking operation the information would appear as:
4346 XXXX XXXX 5379
4493 XXXX XXXX 5787
4297 XXXX XXXX 8724
The masking characters effectively remove much of the sensitive content from the record while still preserving the look and feel. Take care to ensure that enough of the data is masked to preserve security. It would not be hard to regenerate the original credit card number from a masking operation such as:
4297 8296 7496 87XX
since the numbers are generated with a specific and well known checksum algorithm. Also care must be taken not to mask out potentially required information. A masking operation such as
XXXX XXXX XXXX 5379
would strip the card issuer details from the credit card number. This may, or may not, be desirable. If the data is in a specific, invariable format, then Masking Out is a powerful and fast option. If numerous special cases must be dealt with then masking can be slow, extremely complex to administer and can potentially leave some data items inappropriately masked.
Row Internal Synchronization
Data is rarely consistently available in a fully normalized format. Consider the example below in which the FULL_NAME field is composed of data from other columns in the same row.
After the data has been masked, the FIRST_NAME and LAST_NAME
columns will have been changed to other values. For the information to be secure, clearly the FULL_NAME field must also change. However, it must change to contain values synchronized with the rest of the data in the row so that the masked data reflects the denormalized structure of the row. This type of synchronization is called Row-Internal Synchronization and it is quite distinct from the other two types: Table-Internal and Table-To-Table Synchronization.
The Row-Internal Synchronization technique updates a field in a row with a combination of values from the same row. This means that if, after masking, the FIRST_NAME and LAST_NAME change to Albert and Wilson then (in this example) the FULL_NAME column should be updated to contain Albert Wilson. Row-Internal Synchronization is a common requirement and the data scrambling software you choose should support it.
Table Internal Synchronization
Sometimes the same data appears in multiple rows within the same table. In the example below, the name Robert Smith appears in the FIRST_NAME and LAST_NAME columns in multiple rows. In other words, some of the data items are denormalized because of repetitions in
multiple rows. If the name Robert Smith changes to Albert Wilson after masking, then the same Robert Smith referenced in other rows must also change to Albert Wilson in a consistent manner. This requirement is necessary to preserve the relationships between the data rows and is called Table-Internal Synchronization.
A Table-Internal Synchronization operation will update columns in groups of rows within a table to contain identical values. This means that every occurrence of Robert Smith in the table will contain Albert Wilson. Good data anonymization software should provide support for this requirement.
Table-To-Table Synchronization
It is often the case in practical information systems that identical information is distributed among multiple tables in a denormalized format. For example, an employee name may be held in several tables. It is desirable (often essential) that if the name is masked in one column then the other tables in which the information is held should also be updated with an identical value. This requirement is called Table-To-Table Synchronization and is the type of synchronization most people think of
when considering a data anonymization process. It should be noted that there are also two other types of synchronization: Row-Internal and Table-Internal, (see above) and all three have quite different purposes and functions.
Table-To-Table Synchronization operations are designed to correlate any data changes made to a table column with columns in other tables. Thus performing Table-To-Table Synchronization operations requires the knowledge of a join condition between the two tables. The join condition is used to ensure the appropriate rows are updated correctly. Also required, is knowledge of the columns in each of the source and target tables which must be synchronized.
As an example, consider the two tables:
TABLE SOURCE
COLUMN IDNUM number(10)
COLUMN NAME varchar(40)
TABLE TARGET
COLUMN IDNUM number(10)
COLUMN NAME varchar(40)
Assume the NAME column must be synchronized between the two tables and the join condition is
SOURCE.IDNUM=TARGET.IDNUM
If a Substitution operation is performed on the SOURCE.NAME column and a subsequent Table-To-Table Synchronization operation is applied, then each TARGET.NAME column will receive values identical to the ones in the SOURCE table where the join condition matches.
It is important to realize that if there are IDNUM values in the TARGET table that are not present in the SOURCE table, the Table-To-Table Synchronization rule will have no effect on those rows. Table-To-Table Synchronization is probably the most common synchronization task – in fact, it is rare that a data scrambling process does not require it. Every data masking solution should provide support for this operation.
Table-To-Table Synchronization On Primary Key
There are some special cases of Table-To-Table Synchronization which deserve their own discussion. The first is a scenario in which the masked data must be synchronized between tables and in which the data being masked also forms the join condition between the tables.
Table-To-Table Synchronization Via Third Table The second special case in Table-To-Table Synchronization operations is a scenario in which the two tables have no direct join relationship. In other words, the data columns must be synchronized between the two tables but the equivalent rows must be identified by a join made through a third table.
Synchronizing Between Different Datatypes
In many cases, a set of data will contain the same data entity in multiple locations but this data will be stored in different formats. For example, a CUSTOMER_ID field might have a datatype of VARCHAR in one table and INTEGER in another. If this is the case, problems can arise from two sources: if the values from one column are substituted into the other as part of a synchronization operation or if the columns are used as join keys in order to synchronize other data
Cross Database Synchronization
Similar to the requirement for Cross Schema Synchronization is the requirement for Cross Database Synchronization. In this particular scenario the databases are co-located in the same server but the masked data is located in separate databases (and also schemas). If this requirement exists, the analysis phase should carefully plan for it and the database software should be able to support such a synchronization operation.
Cross Server Synchronization
As with Cross Schema and Cross Database Synchronization it is sometimes necessary to synchronize data between platforms. This type of synchronization necessarily involves the physical movement of the data between distinct environments and hence presents a certain technical challenge if the data volumes are large. Fundamentally there are two approaches to the movement of the data: one can use the inter-database transport mechanisms supplied natively by the database vendor, or the data movement can be handled and copied by the data masking software. Typically the inter-database
data transportation mechanisms are highly optimised by the database vendor and also often have advanced functionality such as the ability to create and replicate an entire table as a single command
Cross Platform Server Synchronization
The scenario where data must be kept consistent between tables which are located in physically different servers in databases from different vendors is probably the hardest type of synchronization operation to perform. Many vendors supply data transportation compatibility routines which can move the data to and from servers from other vendors. If this type of synchronization is required, then the data masking software must have the ability to support the use of these routines or have the ability to perform the cross platform copy operation itself
Selective Masking: Ability to Apply a WHERE Clause
It is essential to be able to use specific criteria to choose the rows on which the masking operations are performed. In effect, this means that it must be possible to apply a Where Clause to a set of data and have the masking operations apply only to that subset of the table. As an example of a Where Clause requirement, consider a masking operation on a column containing first names. These names are gender specific and the end users of the database may well require Male and Female names to
be present in the appropriate rows after the masking operations complete. Two rules, each with a
Where Clause based on the gender column will be required here. There is a potential trap here – note the discussion entitled Where Clause Skips in the Data Masking Issues section of this document.
Selective Masking: Ability to Apply Sampling It is often useful to be able to have the ability to apply masking operations to a sample of the data in a table. For example, a column of surnames might be masked using a large dataset of surnames. In reality, some names are much more common than others. In order to ensure that enough duplicate names are present and the masked data better
represents reality, it may be desirable to apply a subsequent masking rule to 10 or 20 percent of the rows in the table using a small dataset of common last names.
Problem with these approaches
- There is an inherent problem with all approaches that generate the perturbed value as a function of the original value …. Y ~ f(X,e)
- These include all noise addition approaches, data swapping, microaggregation, and any variation of these approaches
- Using Delanius’ definition of disclosure risk, all these techniques result in disclosure
- If we attempt to improve disclosure risk, it will adversely affect information loss (and vice versa)
- What we need …
- Is a method that will ensure that the released of the masked data does not result in any additional disclosure, but provides characteristics for the masked data that closely resemble the original data
- From a statistical perspective, at least theoretically, there is a relatively easy solution
Conditional Distribution Approach (CDA)
- Data set consisting of a set of non-confidential variables S and confidential variables X
- Identify the joint distribution f(S,X)
- Compute the conditional distribution f(X|S)
- Generate the masked values yi using f(X|S = si)
- When S is null, simply generate a new data set with the same characteristics as f(X)
- Then the joint distribution of (S and Y) is the same as that of (S and X)
- f(S,Y) = f(S,X)
- Little or no information loss since the joint distribution of the original and masked data are the same
Disclosure Risk of CDA
- When the masked data is generated using CDA, it can be verified that f(X|Y,S,A) = f(X|S,A)
- Releasing the masked microdata Y does not provide any new information to the intruder over and above the non-confidential variables S and A (the aggregate information regarding the joint distribution of S and X)
- The CDA approach results in very low information loss and minimizes disclosure risk and represents a complete solution to the data masking problem
- Unfortunately, in practice
- Identifying f(S,X) may be very difficult
- Deriving f(X|S) may be very difficult
- Generating yi using f(X|S) may be very difficult
- In practice, it is unlikely that we can use the conditional distribution approach
CDA is the answer … but
Model Based Approaches
- Model based approaches for data masking essentially attempt to model the data set by using an assumed f*(S,X) for the joint distribution of (S and X), derive f*(X|S), and generate the masked values from this distribution
- The masked data f(S,Y) will have the joint distribution f*(S,X) rather than the true joint distribution f(S,X)
- If the data is generated using f*(X|S) then the masking procedure minimizes disclosure risk since f(X|Y,S,A) = f(X|S,A)
- Assume that we have one non-confidential variable S and one confidential variable X
- Y = (a × S) + e
- (where e is the noise term)
- We will always get better prediction if we attempt to predict X using S rather than Y (since Y is noisier than S)
- Since we have access to both S and Y, and since S would always provide more information about X than Y, an intelligent intruder will always prefer to use S to predict X than Y
- More importantly, since Y is a function of S and random noise, once S is used to predict X, including Y will not improve your predictive ability
Disclosure risk example
Model Based Masking Methods
- Methods that we have developed and I will be talking about
- General additive data perturbation
- Copula based perturbation
- Data shuffling
- Other Methods
- PRAM
- Multiple imputation
- Skew t perturbation
General Additive Data Perturbation (GADP)
- A linear model based approach. Can maintain the mean vector and covariance matrix of the masked data to be exactly the same as the original data
- The same as sufficiency based noise addition with proximity parameter = 0
- Ensures that the results of all traditional, parametric statistical analyses using the masked data are exactly the same as that using the original data
- Ensure that the release of the masked microdata results in no incremental disclosure
- By maintaining the mean vector and covariance matrix of the two data sets to be exactly the same, for any statistical analysis for which the mean vector and covariance matrix are sufficient statistics, we ensure that the parameter estimates using the masked data will be exactly the same as the original data
- Unfortunately, the marginal distribution of the original data set is altered significantly. In most situations, the marginal distribution of the masked variable bears little or no relationship to the original variable
- The data also could have negative values when the original variable had only positive values
- GADP is useful in a limited context. If the confidential variables do not exhibit significant deviations from normality, then GADP would represent a good solution to the problem
- In other cases, GADP represents a limited solution to the specific users who will use the data mainly for traditional statistical analysis
- We would like the masking procedure to provide some additional benefits (while still minimizing disclosure risk)
- Maintain the marginal distribution
- Maintain non-linear relationships
- To do this, we need to move beyond linear models
- Multiplicative models are not very useful since, in essence, they are just variations of the linear model
Copula Based GADP (C-GADP)
- In statistics, copulas have traditionally been used to model the joint distribution of a set of variables with arbitrary marginal distributions and a specified dependence characteristics
- the ability to maintain the marginal, non-normal distribution of the original attributes to be the same after masking and to preserve certain types of dependence between the attributes
Characteristics of the C-GADP
- C-GADP minimizes disclosure risk
- C-GADP provides the following information loss characteristics
- The marginal distribution of the confidential variables is maintained
- All monotonic relationships are preserved
- Rank order correlation
- Product moment correlation
- Non-monotonic relationships will be modified
- Consider a situation where we have a confidential variable X and a set of non-confidential variables S. If we assume that the MV Copula is appropriate for modeling the data, then the perturbed data Y can be viewed as an independent realization from f(X|S). The marginal of Y is simple a different realization from the same marginal as X. This being the case, reverse map the original values of X in place of the masked values Y. Now the “values” of Y are the same as that of X, but they have been “shuffled”.
Characteristics of Data Shuffling
- Offers all the benefits as CGADP
- Minimum disclosure risk
- Information loss
- Maintains the marginal distribution
- Maintain all monotonic relationships
- Additional benefits
- There is no “modification” of the values. The original values are used
- The marginal distribution of the masked data is exactly the same as the original data
- Implementation can be performed using only the ranks
Advantages of Data Shuffling
- Data shuffling is a hybrid (perturbation and swapping), non-parametric (can be implemented only with rank information) technique for data masking that minimizes disclosure risk and offers the lowest level of information loss among existing methods of data masking
- Will not maintain non-monotonic relationships
- Does not preserve tail dependence
- Can be overcome by using t-copula instead of normal copula
Some Important References
- Dalenius, T. – Methodology for statistical disclosure control
- Fuller, W. A. – Masking procedures for micro data disclosure limitation, Journal of Official Statistics
- Domingo-Ferrer, J. and J.M. Mateo-Sanz – Practical data-oriented micro aggregation for statistical disclosure control, IEEE Transactions on Knowledge and Data Engineering
- Rubin, D. B. – Discussion of statistical disclosure limitation, Journal of Official Statistics
- Moore, R. A. – Controlled data swapping for masking public use micro data sets, Research report series no. RR96/04, U.S. Census Bureau, Statistical Research Division
- Burridge, J. – Information preserving statistical obfuscation, Statistics and Computing