Dev Ops

Refreshing Data

Most applications require a database to support the code. Developers should run their code against production like database to fully test their code.

Ronan Killeen

Sep 4, 2018 • 6 min read

Most applications require some manner of database to support the code. Developers should run their code against instances that are as close to production as possible to fully verify their code.

If it is difficult to get such a copy of production, developers will make changes to the code and be unable to verify that the behaviour is correct. For example, a query that works in a development database, may fail in production due to a schema change that has not been propagated.

Another issue could be that the schema has been changed in production but the developer does not have an up to date version and is effectively testing against the wrong database structure.

Data on development environments is often changed so that personal information, such as names and addresses, get blanked or changed to meaningless data; a knock on effect can be that some parts of the application become to difficult to test.

My company has a suite of databases from multiple vendors. In order to support UAT, we regularly need to take data from production to ensure that the users can interact with real test cases.

The procedure for doing such refreshes is a mixture of automation and manual steps, which include taking backups, moving backups around, applying backups to servers and running scripts to anonymise the data, delete rows to make it a more manageable size, or other configuration scripts against the database.

construction-worker-concrete-hummer-vibrator-38600

Whether the step is manual or not depends on the database type and also the environment that the backup is being applied to.

This leads to delays and issues, some of which are not recognised until the users begin their testing. The handover between the developers requesting the refreshes and the operations team that do them is handled via our ticketing system, JIRA. As it stands, we allow two days to get the databases refreshed and this is too slow.

Another issue is that the databases backups sometimes do not align with each other, for example, a weekly backup from a non essential database compared to a nightly backup of a critical one. This can lead to issues in the environment that may or may not arise – it depends on what is being tested in that phase of UAT.

We don’t have a dedicated team to manage environments – the developers take turns to set-up and manage UAT. Perhaps a dedicated Dev Ops team, as described in this article, could design and build one of the following solutions to remove the need for a dedicated team to manage environments. This team would have to include an operations specialist

Store backups in a repository

Rather than backups being hidden from development teams, store them in a internally public repository that shows the source and time of the backup. This allows the person setting up the environment to be very specific about the backup they want restored.

An extension of this is to take and show backups from UAT and other non-production environments. Developers may want to take a copy of UAT into a development database to try out a patch for a bug. Having a menu to select from will help.

The repository should also show the state of anonymisation of the backup, where fully anonymised backups might be available to anyone, and the actual production backup would be restricted or have to go through the annoymisation process before being restored. It should also show whether the backup is a cut down version, reduced in size to speed up the restore.

Self selection of refreshes

Why not create a web based form that can be filled out to select the refreshes? Give me such and such a source backup and refresh to this environment. Take the operator out of the equation.

The operators follow a list of steps to apply the backup to an environment. Any manual process can be automated. By removing the operator from the process we can remove the handover to and back from them.

Our process involves raising a ticket in JIRA. How long the refresh takes will directly depend on the size of the queue of tickets. Automation takes away this restriction.

Some of our backups are slower than others as the database size is larger or requires more processing or transferring before it can be refreshed to a test database. The time spent to do this can tie up an operator, as they are not really free to do other work during the time of the refresh. Getting this time back is another win for automation.

Authorisation

We should build in some protections on environments so that authorisation at a particular role or level is required. Another approach may be to require an approval for a refresh request before the system will begin.

The use case here might be the UAT environment. The quality assurance manager will have a better picture of what testing is ongoing than individual teams or developers. A database refresh can ruin hours of painstaking test case set-up.

So some environments should be considered protected. A developer could place the request to refresh UAT into the self-service system, triggering an alert to the person responsible. The process would not begin until an authorised person approves the request.

Personal Developer Database

A key to effectively developing software is to get feedback as soon as possible after making a change. Developers run unit tests on their code immediately, static analysis can be set-up to run after a change is committed to source control, etc.

If the code has to be deployed manually to an environment, or queued up behind other changes waiting to be deployed to continuous integration, then a gap can build up between when the change was made and the problem discovered.

In order to allow developers to get immediate feedback on their changes, they may want to run the code against a development database. Sometimes this can involve deploying multiple services and coordinating several databases.

It would be great if a developer could set-up the entire ecosystem within a virtual network, right on their own workstation. Having infrastructure code to create and manage the servers and databases deployable from source is a pre-requisite, but this could be extended to include getting refreshed data from the backups store.

Wouldn’t it be awesome to be able to select the version of an anonymised database and pull it down to a virtual environment running on your local machine so that you can run all the tests before checking in code?

Cut down data set

Most of the time we do not need the complete database. It would be easier and faster to get a cut-down version with just enough user data to allow either automated or exploratory testing.

pexels-photo-203553

We could work with our operators to create such cut-down backups with just enough data to keep the system functioning, but much smaller and thus faster to apply refreshes to development and test databases.

We should design this system so that new test cases can be requested and automatically added to the next refresh, such as adding a new table, or queries to identify the rows that make up cohernt test cases.

To add a new test scenario, the developer would commit a change to the script that builds the cut down set into source control and the next time a cut down backup is scheduled to run, it will automatically pick up the additional test case.

Conclusion

Developing software against old versions of a database can be problematic. Getting copies of production databases into development and test environments can be time consuming and needs co-ordination between development teams and operators. Restricting datasets to avoid propagating sensitive data in non-secure environments in a necessity.

As developers, we should work with our operators to make the database refresh process as simple and self-service as possible, by automating all manual steps, providing authorisation where necessary and only involving operations when the process hits an issue.

Actions

Talk to someone in operations and discuss whether it is possible to create some sort of a view of the available backups and expose meta data: backup date, source, size and anonymisation state.
Build a web page that shows this to developers and QA.
Automate the process of refreshing a database.
Create a self-service page so that a backup and a destination can be selected.
Trigger the automatic refresh from the web page.
Apply authorisation when implementing for UAT and other sensitive environments
Set up a schedule to regularly refresh the Continuous Integration environment.