On data archaeology and how to prevent a "Digital Dark Age"
Have you considered the fact that future generations might have little to no record of the 21st century? It sounds kind of dramatic but the threat of not being able to access obsolete digital data is very real.
The problem was first introduced in 1997 and forebodes the massive loss of precious data, currently stored in proprietary, obscure binary formats which are also encrypted. As time passes by, data that has been produced using vendor-specific software with very old, unmaintained versions with practically no documentation on format details, will eventually be recovered only through reverse engineering.
That is an extreme method of data recovery and it cannot on its own guarantee the completeness of the recovered data. The reason for that is the uncertainty regarding the reconstruction of the original context of the file format in which the data has been stored. It is more of an emergency recovery than a reliable data recovery method… and the more time passes, the higher the chances of losing the knowledge of how those formats are read.
Some big software vendors like Microsoft, Google and IBM have already made attempts to tackle this problem preemptively since this concern was first raised. The UK National Archives, for example, teamed up with Microsoft in 2007 to convert 580TB of data, stored in old Microsoft file formats. Google's vice president Vinton Cerf, who introduced the concept of a Digital Vellum, said: "The onus is on us to kickstart the movement to preserve present-day hardware, software, and content and ensure that they are still accessible 50 years from now, or risk becoming a forgotten century.” IBM Research - Haifa Laboratory launched the Long Term Digital Preservation Project. They believe that our society is facing a Digital Dark Age - although our ability to store digital bits is increasing, our ability to store the data over time decreases.
This brings us to the topic of digital preservation. It is a formal effort to ensure that digital information remains accessible. It is the method of keeping digital material alive so that it remains usable as technological advances render original hardware and software specification obsolete. When digital preservation has not been applied, data archaeology comes into action with a set of methods for recovering information stored in old formats.
Data archaeology is closely related to what I do at work – decommissioning. That is the formal process of removing something from an active status. Also known as Application retirement in relation to application software. It is the practice of shutting down redundant or obsolete business applications while retaining access to the historical data.
The decommissioning process very often comes after data archaeology. I will try to briefly go through the process.
Data is currently stored for preservation purposes only and is accessed rarely if at all. It is hard to reach and it’s not intended to be used often. The reason for keeping it close at hand, but not making it largely available, is the difficulty posed by the software applications used to create it. The data is locked in a proprietary format with no universal way to export it into a standard archive which can be easily validated afterwards.
There are many issues with the data that we receive for decommissioning: improper or incomplete data backup; encrypted or partially encrypted data part of the backup; old, obsolete file formats which make the recovery and data migration more difficult; missing or incorrect metadata forcing the manual identification of the file format and data in the backup; not properly applied digital preservation strategies resulting in data corruption; etc. etc. the list is long.
Given that what we usually decommission is historical data, it is often stored in obsolete and non-standard formats. In such cases the methods used by data archaeology are what helps. My usual routine includes: researching the quirks of “esoteric” formats and the different software vendors that produced them; reading old documents and lots of source code in different programming languages; understanding the file formats and differences between the old and new versions of databases and other software; and searching for patterns to identify data we’ve already encountered. In this way, we extract the data out of those formats while making sure its value is preserved.
Then, the data goes into a system that is compliant with the latest preservation standards. The latter require that data is converted to a well-structured and well-described format, which ensures readability and provides the necessary context to make it as easy as possible to understand the data correctly.
Digital preservation is the last crucial step of the decommissioning process. It not only allows access to the data while keeping it intact through the years as software and hardware advance, but also makes it possible to be updated, while preserving its original value.
Thus, data, for which is unacceptable to get locked, stored away or forgotten about, is prevented from being lost.
To sum up, loss of data isn’t something we see daily. Yet it happens due to numerous reasons. Even with the advent of technologies like quantum computing and strong artificial intelligence and the promises of unimaginable intellectual and processing power they bring, it's still possible that the record of much of what defines our lives will be lost forever.
The more considerate people are of how and what we want to preserve as information, the easier it will be for anybody to just keep with the times. If with my work here at Documaster I could help others appreciate the importance of data preservation - I am all for it.