RSS

Looking into past data: the difference between updates and corrections

Liczba odsłon: 20

Audita­bi­li­ty is an im­por­tant fac­tor when mo­del­ling com­plex da­ta sets. Users need to be able to tell when a spe­ci­fic pie­ce of da­ta changed, who changed it and for what rea­son. However, not every da­ta chan­ge is equal to ano­ther, and while some changes need to be vi­sib­le to the user, others need to re­place older da­ta in a trans­parent way. Read on to learn what it means to up­date and amend da­ta, and how to im­ple­ment these two ope­ra­tions in your data­base.

Photo by Mr Cup / Fabien Barral on Unsplash

Consider the follo­wing da­ta set of em­ploy­ees:

Last nameFirst namePhone num­ber
AdamsAgatha123 456 789
SmithJohn243 567 890

There are at least two rea­sons for making changes to the table. First of all, the da­ta may be wrong. Take a look at John Smith's phone num­ber: some­one mis­ty­ped it, putting 243 567 890 in­stead of 234 567 890. At any point in time, we need to be able to get the da­ta cor­rec­ted. Additio­nal­ly, all the queries reaching back in time need to ignore in­cor­rect da­ta and on­ly return the cor­rec­ted bits.

Secondly, the da­ta itself may chan­ge. Suppose that Agatha Adams has changed her phone car­rier and as a result her phone num­ber has changed to 987 654 321. We need to up­date the da­ta, but re­tain the abi­lity to see what her phone num­ber was be­fore the chan­ge.

To make things more in­te­res­ting, both sce­narios may happen at the sa­me time. We could get John Smith's phone num­ber cor­rec­ted one day, on­ly to learn a few days later that he changed his phone car­rier and num­ber. And we may yet again mis­ty­pe his new num­ber and have to get it cor­rec­ted a week later. And yet we want the sys­tem to re­main con­sis­tent, and al­ways return the correct pie­ce of da­ta for a given time frame.

Updates

Let's deal with the da­ta up­date sce­nario, as it is the most common one. A wi­dely used appro­ach is to add two fields to the data­base sche­ma, which de­fine the vali­dity time frame of this re­cord. These two fields are com­mon­ly named VALID_FROM and VALID_TO. The first one is non-nul­la­ble and con­tains the time­stamp when the re­cord's vali­dity is to begin. The se­cond one is nul­la­ble and if it is null, the re­cord is still va­lid da­ta. If non-null, the re­cord has been in­va­li­da­ted and the time­stamp re­corded in this field tells us when the re­cord's vali­dity ends, ex­clu­sive.

With such an appro­ach, the above da­ta table be­comes:

Last nameFirst namePhone num­berValid fromValid to
AdamsAgatha123 456 7892023-06-01-
SmithJohn243 567 8902023-05-01-

Note that for sim­pli­city, I have de­ci­ded to use on­ly dates, not ti­me­stamps, for the va­lid from and va­lid to fields. This not on­ly makes the examples more con­cise, but also helps to high­light the way these fields should be up­da­ted and que­ried later on. Also, there are some sce­narios in which da­ta vali­dity re­so­lu­tion of days may be ab­so­lu­te­ly ade­quate. However, re­mem­ber that dates are not as sim­ple as they may appear ini­tially, as time zone issues affect them in very nas­ty ways.

Now, sup­po­se we want to in­va­li­da­te Agatha's old phone num­ber and put 987 654 321 as the new one, star­ting from June 20th. In or­der to do so, we need to set the va­lid to field of the in­va­li­da­ted re­cord to 2023-06-10, and add a new re­cord, with the va­lid from field set to 2023-06-10 too, and the va­lid to field set to null.

UPDATE employee
  SET VALID_TO="2023-06-10"
  WHERE LAST_NAME="Adams" AND FIRST_NAME="Agatha" AND VALID_TO IS NULL;
INSERT INTO employee
  VALUES ("Adams", "Agatha", "987 654 321", "2023-06-10", NULL);
Last nameFirst namePhone num­berValid fromValid to
AdamsAgatha123 456 7892023-06-012023-06-10
AdamsAgatha987 654 3212023-06-10-
SmithJohn243 567 8902023-05-01-

It is im­por­tant to keep the sa­me time­stamp in both va­lid to field of the in­va­li­da­ted re­cord and va­lid from field of the new one. This way we eli­mi­na­te any po­ten­tial time win­dow when no re­cord would be con­si­de­red va­lid. Also, the two sta­te­ments need to be exe­cu­ted within a tran­sac­tion, so that no other con­nec­tion is able to see an in­con­sis­tent state.

Now, in or­der to query for the da­ta va­lid on a spe­ci­fic day, we need to li­mit the re­sults to the one re­cord that was already va­lid on that day, and either is still va­lid, or at least did not stop being va­lid until some later day:

SELECT * FROM employee
  WHERE VALID_FROM<="2023-06-09"
  AND (VALID_TO>"2023-06-09" OR VALID_TO IS NULL);
Last nameFirst namePhone num­berValid fromValid to
AdamsAgatha123 456 7892023-06-012023-06-10

You may have no­ticed that the examples above use na­tu­ral da­ta for re­fe­ren­cing re­cords. However, the common prac­tice is to use some kind of a key to uni­que­ly iden­ti­fy ob­jects. In the case of ver­sio­ned re­cords, how­ever, ob­jects may have mul­tiple in­stances and thus the iden­tifier is no lon­ger uni­que. Fortunately, we hard­ly ever need to re­fe­rence in­di­vi­dual in­stances of ob­jects, so we can get away with having dup­li­ca­te iden­ti­fiers:

Person IDLast nameFirst namePhone num­berValid fromValid to
1AdamsAgatha123 456 7892023-06-012023-06-10
1AdamsAgatha987 654 3212023-06-10-
2SmithJohn243 567 8902023-05-01-

In such a case, the query that adds an up­da­ted ver­sion of an em­ployee re­cord gets simp­ler:

UPDATE employee
  SET VALID_TO="2023-06-10"
  WHERE PERSON_ID=1 AND VALID_TO IS NULL;
INSERT INTO employee
  VALUES (1, "Adams", "Agatha", "987 654 321", "2023-06-10", NULL);

If you ever needed to re­fe­rence spe­ci­fic re­cords, you could al­ways add a se­cond, uni­que tech­ni­cal iden­tifier (such as UUID). This is rare­ly ne­ce­ssary, though.

Deletions

The nice thing about the ver­sio­ning scheme pre­sen­ted above is that it makes it trivial to de­lete items from the data­base with­out actu­ally re­moving them. This is called soft de­lete and lets you mark ob­jects as de­le­ted, but re­tain the abi­lity to view their his­to­ri­cal state. This is cru­cial to au­di­ta­bi­li­ty.

Normally, you would need a sepa­rate field in the data­base. However, with the ver­sio­ning scheme applied, all it takes to make an object dis­appear is to set the va­lid to field to the mo­ment of dele­tion, with­out inserting a new one. As there will be no curren­tly va­lid in­stan­ce of the object, the object will ce­ase to exist from the lo­gi­cal stand­point—and yet all its prior in­stances will still be there and will re­spond to his­to­ri­cal queries.

UPDATE employee
  SET VALID_TO="2023-08-01"
  WHERE PERSON_ID=1 AND VALID_TO IS NULL;

Also, if you ever need to re­in­stan­tia­te the object, adding a new in­stan­ce with a com­ple­tely new va­lid from time­stamp will suf­fi­ce. What is great about this appro­ach is that the dis­con­ti­nu­ity of the object's exis­tence will be re­corded in the data­base, and his­to­ri­cal queries tar­ge­ting this in­ter­mit­tent non­exis­ten­ce will return nothing.

Corrections

Data corrections are simi­lar to da­ta up­dates from a tech­ni­cal stand­point. Their im­ple­men­tation can be much simp­ler, though. As we will never need to con­sult any spe­ci­fic wrong, un­cor­rec­ted ver­sion of the object, we do not need to store the “cor­rect­ness pe­riod” se­pa­ra­tely for every re­cord. It is enough to intro­duce one new field, named CORRECTED_ON for exam­ple. If it is null, the re­cord is ab­so­lu­te­ly correct and we can use its con­tent for queries. If it con­tains a time­stamp, it has been amen­ded by a diffe­rent re­cord and needs not to be con­sulted when que­ry­ing.

Now ima­gine we need to get John Smith's phone num­ber cor­rec­ted:

Person IDLast nameFirst namePhone num­berCorrected on
1AdamsAgatha123 456 789-
2SmithJohn243 567 890-

We can do it by set­ting the Corrected on field to the time­stamp of actual da­ta cor­rec­tion (for the most part, it should be the current time­stamp), and then inserting a new re­cord with the in­va­lid da­ta cor­rec­ted:

UPDATE employee
  SET CORRECTED_ON="2023-06-01"
  WHERE PERSON_ID=2 AND CORRECTED_ON IS NULL;
INSERT INTO employee
  VALUES (2, "Smith", "John", "234 567 890", NULL);
Person IDLast nameFirst namePhone num­berCorrected on
1AdamsAgatha123 456 789-
2SmithJohn243 567 8902023-06-01
2SmithJohn234 567 890-

Once again, the sa­me object may appear mul­tiple times. Thus, while we can use an iden­tifier to speci­fy the object to be cor­rec­ted, we have to accept that the sa­me iden­tifier will appear more than once.

Querying for the correct da­ta is actu­ally quite sim­ple. All it takes is to add a con­di­tion that on­ly re­turns the re­cords which con­tain no Corrected on time­stamp:

SELECT * FROM employee
  WHERE CORRECTED_ON IS NULL;

Putting it all to­gether

While the con­cepts of re­cord ver­sio­ning and amen­ding seem sim­ple, mer­ging them to­gether makes things a bit more com­pli­ca­ted.

Let's start with the data­base sche­ma. We need to in­clu­de both the Valid from and Valid to fields, as well as Corrected on:

Person IDLast nameFirst namePhone num­berValid fromValid toCorrected on
1AdamsAgatha123 456 7892023-06-01--
2SmithJohn243 567 8902023-05-01--

Fortunately, there is not much we need to do about the da­ta up­date query. Even if the most re­cent ver­sion of the object con­tains corrections, we want to mark them all as re­placed with a com­ple­tely new oc­cur­ren­ce. Thus, we on­ly need to up­date the query by in­clu­ding the Corrected on field:

UPDATE employee
  SET VALID_TO="2023-06-10"
  WHERE PERSON_ID=1 AND VALID_TO IS NULL;
INSERT INTO employee
  VALUES (1, "Adams", "Agatha", "987 654 321", "2023-06-10", NULL, NULL);

When que­ry­ing for da­ta, we need to fil­ter out all re­cords that have been amen­ded and keep on­ly the ones whose Corrected on field is null. The query will be­come on­ly a bit more com­plex:

SELECT * FROM employee
  WHERE VALID_FROM<="2023-06-09"
  AND (VALID_TO>"2023-06-09" OR VALID_TO IS NULL)
  AND CORRECTED_ON IS NULL;

The da­ta cor­rec­tion sce­nario gets much more diffi­cult, though. Data amend­ment may apply to his­to­ri­cal va­lues, and we need to pre­ci­se­ly speci­fy the ver­sion of the object that needs to be cor­rec­ted. However, we may rely on having the ori­gi­nal re­cord, pre­vio­usly re­trie­ved from the data­base. As a result, we may di­rect­ly use the Valid from and Valid to va­lues when up­dating and adding re­cords.

UPDATE employee
  SET CORRECTED_ON="2023-06-01"
  WHERE PERSON_ID=2 AND VALID_FROM="2023-05-01" AND CORRECTED_ON IS NULL;
INSERT INTO employee
  VALUES (2, "Smith", "John", "234 567 890", "2023-05-01" NULL, NULL);

Actually, the above queries could get a bit simp­ler. By con­fi­guring the Valid to and Corrected on fields so that they de­fault to null, you could omit them in some of the INSERT sta­te­ments.

In most appli­cations, we aim for retrie­ving the most re­cent ver­sion of in­for­ma­tion. So, while the option of accessing a spe­ci­fic ver­sion of the object, or even fetching the com­ple­te his­to­ry of changes, is defini­tely nice, for the most part we will be dealing with the la­test up­date to every object. Fortunately, it is fair­ly easy to get all the re­cords in their la­test appea­ran­ce:

SELECT * FROM employee
  WHERE VALID_TO IS NULL
  AND CORRECTED_ON IS NULL;

References

One thing that gets really mes­sy once you im­ple­ment ver­sio­nab­le, cor­rec­tab­le data­base re­cords is re­fe­ren­cing. You might have de­du­ced it from the fact that there is no uni­que pri­ma­ry key in our sche­ma. There is sim­ply no way to di­rect­ly re­fe­rence a spe­ci­fic re­cord using fo­reign keys.

But even if it was possible, it would not work too well. Two diffe­rent re­cords sto­red in two diffe­rent tab­les may have diffe­rent li­fe­spans. Imagine the follo­wing tab­les:

Person IDFirst nameLast nameValid fromValid to
1AdamsAgatha2023-06-012023-07-01
1CarruthersAgatha2023-07-01-
Person IDPhone num­berValid fromValid to
1123 456 7892023-06-012023-07-15
1345 678 1232023-07-15-

Agatha Adams changed her last name on July 1st. But she also de­ci­ded to chan­ge her phone car­rier two weeks later. There is no way you can link her phone num­bers, with their own vali­dity periods, with her per­so­nal re­cord, with diffe­rent vali­dity periods. If you tried, you would in­evi­tab­ly intro­duce ar­ti­fi­cial boun­da­ries in the re­cord li­fe­cyc­les, and the scheme would fall apart as soon as a do­zen of diffe­rent tab­les would need to link to each other.

You can solve this prob­lem in two ways:

  1. Create really com­plex queries which try to pick and join re­cords from diffe­rent da­ta tab­les based on a spe­ci­fic point in time. This works for two or three tab­les, but may quickly get out of hand.
  2. Move the aggre­gation logic to the code level. This is why con­structs made up of mul­tiple entities are called ag­gre­ga­tes after all. You can re­trieve in­di­vi­dual re­cords or re­cord collec­tions from the data­base using simp­ler queries, and then re­cre­ate the aggre­gate in the appli­ca­tion using these re­cords.

There is more to up­dates than ver­sio­ning

Of course, the above gui­de does not touch all the to­pics re­la­ted to da­ta up­date and mo­di­fi­cation and re­cord ver­sio­ning. In many cases, it is not enough to on­ly keep track of changes made to the da­ta. It may as well be ne­ce­ssary to track in­for­ma­tion about the user who ini­tia­ted the chan­ge and the rea­son the chan­ge was in­tro­du­ced in the first pla­ce.

However, these to­pics exceed the scope of this article. Additio­nal­ly, the actual im­ple­men­tation of these me­cha­nisms will de­pend on the sys­tem-spe­ci­fic re­quire­ments. Feel free to pur­sue the idea with your own im­ple­men­tation, and put a com­ment be­low if you would like to see a sepa­rate pie­ce on this topic on my site.

Playground

You can ex­pe­ri­ment with the ideas pre­sen­ted in this gui­de using your own in­stan­ce of a data­base. However, if you do not have one, you can al­ways try to use one on­line. On SQL Fiddle you can cre­ate your own set of tab­les, fill them with da­ta, and then run queries to see how they work. The site supports mul­tiple SQL dia­lects, in­clu­ding MySQL one used in above examples. The sche­mas are also per­sis­tent, which makes it so much easier to get back to work a day later.

In or­der to de­fine the em­ploy­ees table, you can use the follo­wing pie­ce of code:

CREATE TABLE employee (
  PERSON_ID INTEGER NOT NULL,
  LAST_NAME VARCHAR(256),
  FIRST_NAME VARCHAR(256),
  PHONE_NO VARCHAR(100),
  VALID_FROM DATE NOT NULL,
  VALID_TO DATE NULL DEFAULT NULL,
  CORRECTED_ON DATE NULL DEFAULT NULL);

In or­der to pre-fill the table with ini­tial da­ta, just run the follo­wing sta­te­ments:

INSERT INTO employee VALUES (1, "Adams", "Agatha", "123 456 789", "2023-06-01", NULL, NULL);
INSERT INTO employee VALUES (2, "Smith", "John", "243 567 890", "2023-06-08", NULL, NULL);

And finally, to cre­ate the in­stances of ob­jects being the result of up­dating and amen­ding da­ta, try the follo­wing sta­te­ments:

UPDATE employee SET VALID_TO="2023-06-20" WHERE PERSON_ID=1 AND VALID_TO IS NULL;
INSERT INTO employee VALUES (1, "Adams", "Agatha", "987 654 3217", "2023-06-20", NULL, NULL);

UPDATE employee SET CORRECTED_ON="2023-06-01" WHERE PERSON_ID=2 AND CORRECTED_ON IS NULL;
INSERT INTO employee VALUES (2, "Smith", "John", "234 567 890", "2023-06-08", NULL, NULL);

Could we do it better?

There is certainly a group of people smi­ling when reading this article. They are the prac­ti­tio­ners of the Event Sourcing tech­ni­que, and they do not need to cre­ate such com­plex data­base sche­mas and queries to achieve com­ple­te au­di­ta­bi­li­ty. Event sour­cing in­he­rent­ly pro­vides such capa­bi­li­ties, as events ac­cu­mu­la­ted in the event log re­pre­sent every chan­ge and cor­rec­tion made in the past.

But not every appli­ca­tion needs event sour­cing. This tech­ni­que is hard to get right and needs skil­led hands to pro­per­ly im­ple­ment it. And yet, if sen­sible per­for­mance levels are to be achieved, some ag­gre­ga­tes may need re­la­tio­nal snap­shots sto­red in the data­base. And if these ag­gre­ga­tes need to be que­ried for past state, the snap­shots may need to follow the pattern pre­sen­ted above.

In many cases, a good im­ple­men­tation of do­main driven de­sign, with the data­base per­sis­ting ag­gre­ga­tes, entities and va­lue ob­jects di­rect­ly in a re­la­tio­nal form, is good enough. And if your do­main re­quires you to store in­for­ma­tion about all past changes to da­ta, I hope you find my gui­de help­ful.