

Auditability is an important factor when modelling complex data sets. Users need to be able to tell when a specific piece of data changed, who changed it and for what reason. However, not every data change is equal to another, and while some changes need to be visible to the user, others need to replace older data in a transparent way. Read on to learn what it means to update and amend data, and how to implement these two operations in your database.
Consider the following data set of employees:
| Last name | First name | Phone number |
|---|---|---|
| Adams | Agatha | 123 456 789 |
| Smith | John | 243 567 890 |
There are at least two reasons for making changes to the table. First of all, the data may be wrong. Take a look at John Smith's phone number: someone mistyped it, putting 243 567 890 instead of 234 567 890. At any point in time, we need to be able to get the data corrected. Additionally, all the queries reaching back in time need to ignore incorrect data and only return the corrected bits.
Secondly, the data itself may change. Suppose that Agatha Adams has changed her phone carrier and as a result her phone number has changed to 987 654 321. We need to update the data, but retain the ability to see what her phone number was before the change.
To make things more interesting, both scenarios may happen at the same time. We could get John Smith's phone number corrected one day, only to learn a few days later that he changed his phone carrier and number. And we may yet again mistype his new number and have to get it corrected a week later. And yet we want the system to remain consistent, and always return the correct piece of data for a given time frame.
Let's deal with the data update scenario, as it is the most common one. A widely used approach is to add two fields to the database schema, which define the validity time frame of this record. These two fields are commonly named VALID_FROM and VALID_TO. The first one is non-nullable and contains the timestamp when the record's validity is to begin. The second one is nullable and if it is null, the record is still valid data. If non-null, the record has been invalidated and the timestamp recorded in this field tells us when the record's validity ends, exclusive.
With such an approach, the above data table becomes:
| Last name | First name | Phone number | Valid from | Valid to |
|---|---|---|---|---|
| Adams | Agatha | 123 456 789 | 2023-06-01 | - |
| Smith | John | 243 567 890 | 2023-05-01 | - |
Note that for simplicity, I have decided to use only dates, not timestamps, for the valid from and valid to fields. This not only makes the examples more concise, but also helps to highlight the way these fields should be updated and queried later on. Also, there are some scenarios in which data validity resolution of days may be absolutely adequate. However, remember that dates are not as simple as they may appear initially, as time zone issues affect them in very nasty ways.
Now, suppose we want to invalidate Agatha's old phone number and put 987 654 321 as the new one, starting from June 20th. In order to do so, we need to set the valid to field of the invalidated record to 2023-06-10, and add a new record, with the valid from field set to 2023-06-10 too, and the valid to field set to null.
UPDATE employee
SET VALID_TO="2023-06-10"
WHERE LAST_NAME="Adams" AND FIRST_NAME="Agatha" AND VALID_TO IS NULL;
INSERT INTO employee
VALUES ("Adams", "Agatha", "987 654 321", "2023-06-10", NULL);
| Last name | First name | Phone number | Valid from | Valid to |
|---|---|---|---|---|
| Adams | Agatha | 123 456 789 | 2023-06-01 | 2023-06-10 |
| Adams | Agatha | 987 654 321 | 2023-06-10 | - |
| Smith | John | 243 567 890 | 2023-05-01 | - |
It is important to keep the same timestamp in both valid to field of the invalidated record and valid from field of the new one. This way we eliminate any potential time window when no record would be considered valid. Also, the two statements need to be executed within a transaction, so that no other connection is able to see an inconsistent state.
Now, in order to query for the data valid on a specific day, we need to limit the results to the one record that was already valid on that day, and either is still valid, or at least did not stop being valid until some later day:
SELECT * FROM employee WHERE VALID_FROM<="2023-06-09" AND (VALID_TO>"2023-06-09" OR VALID_TO IS NULL);
| Last name | First name | Phone number | Valid from | Valid to |
|---|---|---|---|---|
| Adams | Agatha | 123 456 789 | 2023-06-01 | 2023-06-10 |
You may have noticed that the examples above use natural data for referencing records. However, the common practice is to use some kind of a key to uniquely identify objects. In the case of versioned records, however, objects may have multiple instances and thus the identifier is no longer unique. Fortunately, we hardly ever need to reference individual instances of objects, so we can get away with having duplicate identifiers:
| Person ID | Last name | First name | Phone number | Valid from | Valid to |
|---|---|---|---|---|---|
| 1 | Adams | Agatha | 123 456 789 | 2023-06-01 | 2023-06-10 |
| 1 | Adams | Agatha | 987 654 321 | 2023-06-10 | - |
| 2 | Smith | John | 243 567 890 | 2023-05-01 | - |
In such a case, the query that adds an updated version of an employee record gets simpler:
UPDATE employee SET VALID_TO="2023-06-10" WHERE PERSON_ID=1 AND VALID_TO IS NULL; INSERT INTO employee VALUES (1, "Adams", "Agatha", "987 654 321", "2023-06-10", NULL);
If you ever needed to reference specific records, you could always add a second, unique technical identifier (such as UUID). This is rarely necessary, though.
The nice thing about the versioning scheme presented above is that it makes it trivial to delete items from the database without actually removing them. This is called soft delete and lets you mark objects as deleted, but retain the ability to view their historical state. This is crucial to auditability.
Normally, you would need a separate field in the database. However, with the versioning scheme applied, all it takes to make an object disappear is to set the valid to field to the moment of deletion, without inserting a new one. As there will be no currently valid instance of the object, the object will cease to exist from the logical standpoint—and yet all its prior instances will still be there and will respond to historical queries.
UPDATE employee SET VALID_TO="2023-08-01" WHERE PERSON_ID=1 AND VALID_TO IS NULL;
Also, if you ever need to reinstantiate the object, adding a new instance with a completely new valid from timestamp will suffice. What is great about this approach is that the discontinuity of the object's existence will be recorded in the database, and historical queries targeting this intermittent nonexistence will return nothing.
Data corrections are similar to data updates from a technical standpoint. Their implementation can be much simpler, though. As we will never need to consult any specific wrong, uncorrected version of the object, we do not need to store the “correctness period” separately for every record. It is enough to introduce one new field, named CORRECTED_ON for example. If it is null, the record is absolutely correct and we can use its content for queries. If it contains a timestamp, it has been amended by a different record and needs not to be consulted when querying.
Now imagine we need to get John Smith's phone number corrected:
| Person ID | Last name | First name | Phone number | Corrected on |
|---|---|---|---|---|
| 1 | Adams | Agatha | 123 456 789 | - |
| 2 | Smith | John | 243 567 890 | - |
We can do it by setting the Corrected on field to the timestamp of actual data correction (for the most part, it should be the current timestamp), and then inserting a new record with the invalid data corrected:
UPDATE employee SET CORRECTED_ON="2023-06-01" WHERE PERSON_ID=2 AND CORRECTED_ON IS NULL; INSERT INTO employee VALUES (2, "Smith", "John", "234 567 890", NULL);
| Person ID | Last name | First name | Phone number | Corrected on |
|---|---|---|---|---|
| 1 | Adams | Agatha | 123 456 789 | - |
| 2 | Smith | John | 243 567 890 | 2023-06-01 |
| 2 | Smith | John | 234 567 890 | - |
Once again, the same object may appear multiple times. Thus, while we can use an identifier to specify the object to be corrected, we have to accept that the same identifier will appear more than once.
Querying for the correct data is actually quite simple. All it takes is to add a condition that only returns the records which contain no Corrected on timestamp:
SELECT * FROM employee WHERE CORRECTED_ON IS NULL;
While the concepts of record versioning and amending seem simple, merging them together makes things a bit more complicated.
Let's start with the database schema. We need to include both the Valid from and Valid to fields, as well as Corrected on:
| Person ID | Last name | First name | Phone number | Valid from | Valid to | Corrected on |
|---|---|---|---|---|---|---|
| 1 | Adams | Agatha | 123 456 789 | 2023-06-01 | - | - |
| 2 | Smith | John | 243 567 890 | 2023-05-01 | - | - |
Fortunately, there is not much we need to do about the data update query. Even if the most recent version of the object contains corrections, we want to mark them all as replaced with a completely new occurrence. Thus, we only need to update the query by including the Corrected on field:
UPDATE employee SET VALID_TO="2023-06-10" WHERE PERSON_ID=1 AND VALID_TO IS NULL; INSERT INTO employee VALUES (1, "Adams", "Agatha", "987 654 321", "2023-06-10", NULL, NULL);
When querying for data, we need to filter out all records that have been amended and keep only the ones whose Corrected on field is null. The query will become only a bit more complex:
SELECT * FROM employee WHERE VALID_FROM<="2023-06-09" AND (VALID_TO>"2023-06-09" OR VALID_TO IS NULL) AND CORRECTED_ON IS NULL;
The data correction scenario gets much more difficult, though. Data amendment may apply to historical values, and we need to precisely specify the version of the object that needs to be corrected. However, we may rely on having the original record, previously retrieved from the database. As a result, we may directly use the Valid from and Valid to values when updating and adding records.
UPDATE employee SET CORRECTED_ON="2023-06-01" WHERE PERSON_ID=2 AND VALID_FROM="2023-05-01" AND CORRECTED_ON IS NULL; INSERT INTO employee VALUES (2, "Smith", "John", "234 567 890", "2023-05-01" NULL, NULL);
Actually, the above queries could get a bit simpler. By configuring the Valid to and Corrected on fields so that they default to null, you could omit them in some of the INSERT statements.
In most applications, we aim for retrieving the most recent version of information. So, while the option of accessing a specific version of the object, or even fetching the complete history of changes, is definitely nice, for the most part we will be dealing with the latest update to every object. Fortunately, it is fairly easy to get all the records in their latest appearance:
SELECT * FROM employee WHERE VALID_TO IS NULL AND CORRECTED_ON IS NULL;
One thing that gets really messy once you implement versionable, correctable database records is referencing. You might have deduced it from the fact that there is no unique primary key in our schema. There is simply no way to directly reference a specific record using foreign keys.
But even if it was possible, it would not work too well. Two different records stored in two different tables may have different lifespans. Imagine the following tables:
| Person ID | First name | Last name | Valid from | Valid to |
|---|---|---|---|---|
| 1 | Adams | Agatha | 2023-06-01 | 2023-07-01 |
| 1 | Carruthers | Agatha | 2023-07-01 | - |
| Person ID | Phone number | Valid from | Valid to |
|---|---|---|---|
| 1 | 123 456 789 | 2023-06-01 | 2023-07-15 |
| 1 | 345 678 123 | 2023-07-15 | - |
Agatha Adams changed her last name on July 1st. But she also decided to change her phone carrier two weeks later. There is no way you can link her phone numbers, with their own validity periods, with her personal record, with different validity periods. If you tried, you would inevitably introduce artificial boundaries in the record lifecycles, and the scheme would fall apart as soon as a dozen of different tables would need to link to each other.
You can solve this problem in two ways:
Of course, the above guide does not touch all the topics related to data update and modification and record versioning. In many cases, it is not enough to only keep track of changes made to the data. It may as well be necessary to track information about the user who initiated the change and the reason the change was introduced in the first place.
However, these topics exceed the scope of this article. Additionally, the actual implementation of these mechanisms will depend on the system-specific requirements. Feel free to pursue the idea with your own implementation, and put a comment below if you would like to see a separate piece on this topic on my site.
You can experiment with the ideas presented in this guide using your own instance of a database. However, if you do not have one, you can always try to use one online. On SQL Fiddle you can create your own set of tables, fill them with data, and then run queries to see how they work. The site supports multiple SQL dialects, including MySQL one used in above examples. The schemas are also persistent, which makes it so much easier to get back to work a day later.
In order to define the employees table, you can use the following piece of code:
CREATE TABLE employee ( PERSON_ID INTEGER NOT NULL, LAST_NAME VARCHAR(256), FIRST_NAME VARCHAR(256), PHONE_NO VARCHAR(100), VALID_FROM DATE NOT NULL, VALID_TO DATE NULL DEFAULT NULL, CORRECTED_ON DATE NULL DEFAULT NULL);
In order to pre-fill the table with initial data, just run the following statements:
INSERT INTO employee VALUES (1, "Adams", "Agatha", "123 456 789", "2023-06-01", NULL, NULL); INSERT INTO employee VALUES (2, "Smith", "John", "243 567 890", "2023-06-08", NULL, NULL);
And finally, to create the instances of objects being the result of updating and amending data, try the following statements:
UPDATE employee SET VALID_TO="2023-06-20" WHERE PERSON_ID=1 AND VALID_TO IS NULL; INSERT INTO employee VALUES (1, "Adams", "Agatha", "987 654 3217", "2023-06-20", NULL, NULL); UPDATE employee SET CORRECTED_ON="2023-06-01" WHERE PERSON_ID=2 AND CORRECTED_ON IS NULL; INSERT INTO employee VALUES (2, "Smith", "John", "234 567 890", "2023-06-08", NULL, NULL);
There is certainly a group of people smiling when reading this article. They are the practitioners of the Event Sourcing technique, and they do not need to create such complex database schemas and queries to achieve complete auditability. Event sourcing inherently provides such capabilities, as events accumulated in the event log represent every change and correction made in the past.
But not every application needs event sourcing. This technique is hard to get right and needs skilled hands to properly implement it. And yet, if sensible performance levels are to be achieved, some aggregates may need relational snapshots stored in the database. And if these aggregates need to be queried for past state, the snapshots may need to follow the pattern presented above.
In many cases, a good implementation of domain driven design, with the database persisting aggregates, entities and value objects directly in a relational form, is good enough. And if your domain requires you to store information about all past changes to data, I hope you find my guide helpful.