Testing in your development environments using data from your production systems gives you accurate feedback on the latest additions to your application, but how do you avoid breaking Data Security rules?
A problem with incremental development and testing at scale
You’re developing a web app, and done the right things. You’ve worked hard to get confidence that your code will work at production-scale - generating large, realistic data volumes to test your code against. You test and integrate often and you push your code to production often because you know that the load and the data produced by users at scale are the real test of your system.But you hit the wall, your generated data isn’t good enough – it’s neither accurate or wide ranging enough - users don’t always use the software like you expected. Generating tonnes of accurate data against a changing feature set and schema is costly and a hell-of-a-pain. Frustratingly there is a database full of real data sitting just over there in production.
You want to use your production data to test your software but you can’t – either the data is just too sensitive to move to less secure environments, or you are working offshore or in the cloud and your Data Protection rules don’t allow you and your team bulk access to production data.
Basic anonymisation processes can be risky
Production data is often a secure store of sensitive personal data; Data Protection officers are keen that it stays that way. If we want that data, we can write a simple script to grab it and anonymise the private bits before it moves.But in a Continuous Delivery environment, the data schemas are under constant change – all part of working incrementally and iteratively. And as soon as we make feature changes that require changes to the data schemas, we risk exposing important data.
Using a Blacklist Anonymisation process puts us one schema change away from a data security leak. We need something more watertight.
Blacklist AnonymisationObscuring sensitive data in a schema with safe boilerplate copy, though a list of fields to be obscured |
A safer option
Using a more mature anonymisation process means our data is more likely to remain safe as the schema develops.Rather than having the constant vigilance of managing the list of fields to anonymise, everything gets obscured by default. We specify which fields fine to let through unobscured – usually the structural data that sets up the relationships between your data items, plus some safe data, the rest is removed or obscured. This way a schema change that renames a field or restructures a collection of data, is likely to only degrade your anonymisation, it won’t have leaked personal details out of the secure store.
Obscuring of data with boilerplate such as ‘Lorem ipsum’ for copy, and placeholder or random values for other data types fills the gaps in a schema safely - whilst retaining the feel of the original data. Tricksy fields may be dropped or nulled.
This gives a large realistic data-set for testing. Many important data points can be gathered around data-size expectations and you should be able to get further confidence in your predictions of software performance.
Whitelist Anonymisation
Obscuring all data in a schema with safe boilerplate copy by default, allowing very specific fields pass though uncorrupted.
|
Making the data more useful
With Whitelist Anonymisation we have data safety and data-at-scale, but the data isn’t all that useful yet – a richer dataset would be useful for exploratory testing, as well as more in depth realistic performance measurements. We can apply special rules to important fields, to ensure that they remain obscured but provide more useful values. For example Geo-location values can be driven from a known set or locations can be offset by a random safe amount. Postcodes and zip codes can be tweaked shortened or relocated, ages varied within ranges, and emails directed to test accounts.These data adjustments are put in place whilst keeping to the data security aims: usually to ensure there is not enough data remaining that would allow a person to be identified. Working in partnership with a Data Protection expert to ensure correct compliance is advised.
Graylist Anonymisation
Adjusting specific fields in non-reversible ways whilst anonymising a schema, to leave a wide range of domain specific data available for testing without revealing private information.
|
No comments :
Post a Comment