Data management has always labored under the impression that it was just too difficult a task to take on. Face it: there is a lot of data recorded on storage media in most firms. It mostly consists of files created by users or applications that wasted no effort identifying the contents of the file in an objectively intelligible way.
Some of this data may have importance or value; but, much does not. So, just beginning the data management exercise -- or one of the subordinate data management tasks like developing an information security strategy or a data protection strategy or an archive strategy -- first requires the segregation of data into classes: what's important, what's required to be retained in accordance with assorted laws or regulations (and do you even know which regs or laws are applicable to you?), what needs to be retained and for how long, etc.
Sorting through the storage "junk drawer" is considered a laborious task that absolutely no one wants to be assigned. And, assuming you do manage to sort your existing data, it is never enough. There is another wave of data coming behind the one that created the mess you already have. Talk about the Myth of Sysiphus.
What? You are still reading. Are you nuts?
Of course, everyone is hoping that data management will get easier, that wizards of automation will define tools to help corall and segregate all the bits.
Some offer a rip and replace strategy: rip out your existing file system and replace it with object storage. With object storage, all of your data is wrapped into a database construct that is rich with metadata. Sounds like just the thing, but it is a strategy that is easiest to deploy in a "greenfield" situation -- not one that is readily deployed after years of amassing undifferentiated data.
Another strategy is to deduplicate everything. That is, use software or hardware data reduction to squeeze more anonymous bits into a fixed amount of storage space. This may fix the capacity issue associated with the data explosion...but only temporarily.
Another strategy is to find all files that haven't been accessed in 30, 60 or 90 days, then just export those files into a cheap storage repository somewhere. If any of the data is ever needed again -- say, for legal discovery -- just provide a copy of this junk drawer, whether on premises or in a cloud, and let someone else sort through it all.
Bottom line: just getting data into a manageable state is a pain. Needed are tools that can apply policies to data automatically, based on metadata. At a minimum, we should have automated tools to identify duplicates and dreck, so it can be deleted, and other tools that can place the remaining data into a low cost archive for later re-reference. This isn't perfect, but it is possible with what we have today.
Going forward, we need to set up a strategy for marking files in a more intelligent way. That may involve adding a step to the workflow in which the file creator creates keywords and tags on files when saving them -- a step that can't be overwritten by the user! Virtually every productivity app has the capability for the user to enter granular descriptions of files, and some actually save this data about the data to a metadata construct appropriate for the file system or object model used to format the data itself.
If that seems too "brute force," another option is to mark the files transparently as they are saved. Link file classification to who the user is who created the file based on a user ID or login or something. If the user works in accounting, treat all of his or her output as accounting data and apply a policy to the data appropriate to accounting data. That can be done by referencing an access control system like Active Directory to identify the department-qua-subnetwork in which the user works.
Another approach might be to tag the data based on the workstation used to create the file. Microsoft opened up its File Classification Infrastructure a few years ago. That's the thing that shows attributes for files when you right click the file name: HIDDEN, SECURE, ARCHIVE, etc. With FCI opened up for user modification, each PC in the shop can be customized with additional attributes (like ACCOUNTING) that will be stored with data created on that workstation.
Whether you mark the file by user role or by workstation/department, it isn't as effective as manually entering granular metadata for every file that is created. So it won't be as effective as, say, deploying an object storage solution and manually migrating files into that object storage system while editing the metadata of each file. You will get a lot of "false positives" and this will mitigate the efficiency of your storage or your archive or whatever.
Unfortunately, the tools for data management are difficult to get information on. As reported in another blog post, doing an internet search for data management solutions yields a bunch of stuff that really has nothing to do with the metadata-based application of storage policy to files and objects. Many of the tools are bridges to cloud services, or they are backup software tools whose vendors are trying to teach some new tricks, like archive. Others are just a wholesale effort of the vendor to grab you by your data, figuring that your hearts and minds will follow.
We believe that cognitive data management is the future. Take tools for storage resource management and monitoring and for storage service management and monitoring and for global namespace creation and monitoring, then integrate the information contained in all three (all of which is being updated at all times) so that the right data is stored on the right storage and receives the right services (privacy, protection and preservation) based on a policy that is created by busienss and technology users who are in a position to know what the data is and how it needs to be handled.
Such cognitive data management tools are only now beginning to appear in the market. Watch this space for the latest information on what the developers are coming up with to simplify data management.