Posts from category "Cognitive Data Management Blog"

Looking for Data Management Tools that Work: Watch this Space

Data management has always labored under the impression that it was just too difficult a task to take on.  Face it: there is a lot of data recorded on storage media in most firms.  It mostly consists of files created by users or applications that wasted no effort identifying the contents of the file in an objectively intelligible way. 

Some of this data may have importance or value; but, much does not. So, just beginning the data management exercise -- or one of the subordinate data management tasks like developing an information security strategy or a data protection strategy or an archive strategy -- first requires the segregation of data into classes:  what's important, what's required to be retained in accordance with assorted laws or regulations (and do you even know which regs or laws are applicable to you?), what needs to be retained and for how long, etc. 

Sorting through the storage "junk drawer" is considered a laborious task that absolutely no one wants to be assigned.  And, assuming you do manage to sort your existing data, it is never enough.  There is another wave of data coming behind the one that created the mess you already have.  Talk about the Myth of Sysiphus.

What?  You are still reading.  Are you nuts?

Of course, everyone is hoping that data management will get easier, that wizards of automation will define tools to help corall and segregate all the bits.

Some offer a rip and replace strategy:  rip out your existing file system and replace it with object storage.  With object storage, all of your data is wrapped into a database construct that is rich with metadata.  Sounds like just the thing, but it is a strategy that is easiest to deploy in a "greenfield" situation -- not one that is readily deployed after years of amassing undifferentiated data.

Another strategy is to deduplicate everything.  That is, use software or hardware data reduction to squeeze more anonymous bits into a fixed amount of storage space.  This may fix the capacity issue associated with the data explosion...but only temporarily.

Another strategy is to find all files that haven't been accessed in 30, 60 or 90 days, then just export those files into a cheap storage repository somewhere.  If any of the data is ever needed again -- say, for legal discovery -- just provide a copy of this junk drawer, whether on premises or in a cloud, and let someone else sort through it all.

Bottom line:  just getting data into a manageable state is a pain.  Needed are tools that can apply policies to data automatically, based on metadata.  At a minimum, we should have automated tools to identify duplicates and dreck, so it can be deleted, and other tools that can place the remaining data into a low cost archive for later re-reference.  This isn't perfect, but it is possible with what we have today.

Going forward, we need to set up a strategy for marking files in a more intelligent way.  That may involve adding a step to the workflow in which the file creator creates keywords and tags on files when saving them -- a step that can't be overwritten by the user!  Virtually every productivity app has the capability for the user to enter granular descriptions of files, and some actually save this data about the data to a metadata construct appropriate for the file system or object model used to format the data itself.

If that seems too "brute force," another option is to mark the files transparently as they are saved.  Link file classification to who the user is who created the file based on a user ID or login or something.  If the user works in accounting, treat all of his or her output as accounting data and apply a policy to the data appropriate to accounting data.  That can be done by referencing an access control system like Active Directory to identify the department-qua-subnetwork in which the user works. 

Another approach might be to tag the data based on the workstation used to create the file.  Microsoft opened up its File Classification Infrastructure a few years ago.  That's the thing that shows attributes for files when you right click the file name:  HIDDEN, SECURE, ARCHIVE, etc.  With FCI opened up for user modification, each PC in the shop can be customized with additional attributes (like ACCOUNTING) that will be stored with data created on that workstation. 

Whether you mark the file by user role or by workstation/department, it isn't as effective as manually entering granular metadata for every file that is created.  So it won't be as effective as, say, deploying an object storage solution and manually migrating files into that object storage system while editing the metadata of each file.  You will get a lot of "false positives" and this will mitigate the efficiency of your storage or your archive or whatever.



Unfortunately, the tools for data management are difficult to get information on.  As reported in another blog post, doing an internet search for data management solutions yields a bunch of stuff that really has nothing to do with the metadata-based application of storage policy to files and objects.  Many of the tools are bridges to cloud services, or they are backup software tools whose vendors are trying to teach some new tricks, like archive.  Others are just a wholesale effort of the vendor to grab you by your data, figuring that your hearts and minds will follow.

We believe that cognitive data management is the future.  Take tools for storage resource management and monitoring and for storage service management and monitoring and for global namespace creation and monitoring, then integrate the information contained in all three (all of which is being updated at all times) so that the right data is stored on the right storage and receives the right services (privacy, protection and preservation) based on a policy that is created by busienss and technology users who are in a position to know what the data is and how it needs to be handled.

Such cognitive data management tools are only now beginning to appear in the market.  Watch this space for the latest information on what the developers are coming up with to simplify data management.

Surveying the Data Management Vendor Market: Methodology
The data management market today comprises many products and technologies, but comparatively few that include all of the components and constructs enumerated above.  To demonstrate the differences, we surveyed the offerings of vendors that frequently appear in trade press accounts, industry analyst reports and web searches.  Our list originally included the following companies: 
  • Avere Systems*
  • Axaem
  • CTERA*
  • Clarity NOW Data Frameworks
  • Cloudian HyperStore
  • Cohesity*
  • Egnyte
  • ElastiFile*
  • Gray Meta Platform
  • IBM*
  • Komprise
  • Nasuni*
  • Panzura*
  • Primary Data*
  • QStar Technologies*
  • Qubix
  • Seven10
  • SGL
  • ShinyDocs
  • StarFish Global
  • StorageDNA*
  • STRONGBOX Data Solutions*
  • SwiftStack Object Storage*
  • Talon*
  • Tarmin*
  • Varonis
  • Versity Storage Manager 
Only a subset of these firms responded to our requests for interview (denoted with asterisks) which we submitted by email either to the point of contact identified on their websites or in press releases.  After scheduling interviews, we invited respondents to provide us with their “standard analyst or customer product pitch” – usually delivered as a presentation across a web-based platform – and we followed up with questions to enable comparisons of the products with each other.  We wrote up our notes from each interview and submitted them to the vendor to ensure that we had not misconstrued or misunderstood their products.    These interviews were updated to ensure their accuracy when comments were received back from the respondents.  Following are those discussions.
What is Cognitive Data Management?

Ideally, a data management solution will provide a means to monitor data itself – the status of data as reflected in its metadata – since this is how data is instrumented for management in the first place.  Metadata can provide insights into data ownership at the application, user, server, and business process level.  It also provides information about data access and update frequency and physical location.



A real data management solution will offer a robust mechanism for consolidating and indexing this file metadata into a unified or global namespace construct.  This provides uniform access to file listings to all authorized users (machine and human) and a location where policies for managing data over time can be readily applied.

That suggests a second function of a comprehensive or real data management solution.  It must provide a mechanism for creating management policies and for assigning those policies to specific data to manage it through its useful life.  

A data management policy may offer simplistic directions.  For example, it may specify that when accesses to the data fall to zero for thirty days, the data should be migrated off of expensive high performance storage to a less expensive lower performance storage target.  However, data management policies can also define more complex interrelationships between data, or they may define specific and granular service changes to data that are to be applied at different times in the data lifecycle.  Initially, for example, data may require continuous data protection in the form of a snapshot every few seconds or minutes in order to capture rapidly accruing changes to the data.  Over time, however, as update frequency slows, the protective services assigned to the data may also need change – from continuous data protection snapshots to nightly backups, for example.  Such granular service changes may also be defined in a policy.

The policy management framework provides a means to define and use the information from a global namespace to meet the changing storage resource requirements and storage service requirements (protection, preservation and privacy are defined as discrete services) of the data itself.  The work of provisioning storage resources and services to data, however, anticipates two additional components of a data management solution.

In addition to a policy management framework and global namespace, a true data management solution requires a storage resource management component and a storage services component.  The storage resource management component inventories and tracks the status of the storage that may be used to provide hosting for data.  This component monitors the responsiveness of the storage resource to access requests as well as its current capacity usage.  It also tracks the performance of various paths to the storage component via networks, interconnects, or fabrics.  

The storage services management component performs roughly the same work as the storage resource manager, but with respect to storage services for protection, preservation and privacy.  This management engine identifies all service providers, whether they are software providers operated on dedicated storage controllers, or as part of a software-defined storage stack operated on a server, or as stand-alone third party software products.  The service manager identifies the load on each provider to ensure that no one provider is overloaded with too many service requests.

Together with the policy management framework and global namespace, storage resource and storage service managers provide all of the information required by decision-makers to select the appropriate resources and services to provision to the appropriate data at the appropriate time in fulfillment of policy requirements.  That is an intelligent data management service – with a human decision-maker providing the “intelligence” to apply the policy and provision resources and services to data.

However, given the amount of data in even a small-to-medium-sized business computing environment, human decision-makers may be overwhelmed by the sheer volume of data management work that is required.  For this reason, cognitive computing has found its way into the ideal data management solution.  

A cognitive computing engine – whether in the form of an algorithm, a Boolean logic tree, or an artificial intelligence construct – supplements manual methods of data management and makes possible the efficient handling of extremely large and diverse data management workloads.  This cognitive engine is the centerpiece of “cognitive data management” and is rapidly becoming the sine qua non of contemporary data management technology and a key differentiator between data management solutions in the market.

Starting from the Beginning

What exactly is data management?

Data management means different things to different people.  To most, it is a term used to describe the deliberate movement of data between different data storage components during the useful life of the data itself.  The rationale for such movement is often helpful in differentiating data management products from one another.

For example, data may be moved to decrease storage costs.  Different storage devices may be grouped together by performance and cost characteristics to define “tiers” of storage infrastructure. 

Data that is accessed and updated frequently may be best hosted in the highest performance (most costly) tiers, while data that is older and less frequently accessed or updated may be more economically hosted on less performant and less expensive tiers.  A product that tracks data access and modification frequency and that moves less active data to slower tiers automatically may be termed a data management solution, though such products are more appropriately termed hierarchical storage management or HSM products.

Similarly, data may be migrated between storage devices to level or optimize the load placed on specific devices or interconnecting links.  This may be done to improve overall access performance by introducing target parallelism or simply to scale overall capacity more efficiently.  It may also provide a means to enable the decommissioning of certain storage products when they have reached end of life by providing a way to migrate or copy their data to alternative or newer storage with minimal operator intervention.  Again, this may be termed data management, but it is actually infrastructure management or scale out architecture.

Moving or copying data between storage platforms may also be performed in order to preserve certain data assets that, while they are rarely accessed or updated, must be retained for legal, regulatory or business reasons.  The target “archival” storage may comprise very low cost, very high capacity media such as tape.  Technically speaking, this is an archive rather than a data management product. 

Archive may be part of data management, but it is not necessarily a data management solution unto itself.

Data management is a policy-driven exercise.  True data management involves the active placement of data across infrastructure based on business policy coupled to the business context and value of the data itself.  While the expense of the storage target and the frequency of access and modification of the data can – and should – also serve as variables in the determination of policies for how the data should be hosted, what services the data is provided to ensure its protection, preservation and privacy, and when it should be moved or discarded, real data management considers data value, not just storage capacity and cost.  Its goal is more than improving capacity allocation efficiency; data management strives to improve capacity utilization efficiency.  Data management is business centric, not storage centric.

Many of the products obtained in web searches as “data management solutions” do not deliver business value centric management at all.  Some are HSM, migration, or archival products simply.  Others have an underlying agenda, to move data out of one architectural or topological model into another.  For example, several firms are terming as data management solutions products that are intended to bridge on-premises hosted data into a cloud service model.  Others are, under the covers, seeking to move file system data into object storage system models, or hard disk hosted data into solid state storage products leveraging non-volatile memory chips. 

While potentially useful migratory tools, these are not necessarily what a consumer may be seeking who is trying to place data under greater business control so it can be shared more efficiently, used in analytical research more readily, or governed in accordance with the latest legal or regulatory mandate.

Following on this thread, we will look at the information gleaned from the web about vendors whose products result from a web search engine query on the term "data management."  If you are a vendor or user of any of these products, please expand our research with additional information.

Welcome to the Cognitive Data Management Blog at DMI

Welcome to our blog on cognitive data management at DMI.  This is intended to become a forum for the community of data managers who are interested in simplifying, streamlining and automating the data management workload through the application of cognitive computing technology.

"Cognitive" sounds so trendy.  What "cognitive" is varies depending on who you ask.  

In some cases, cognitive computing is metaphorical.  It refers to a fairly common software engine that simply executes predefined instructions written in any number of scripting or programming languages.

In other cases, cognitive computing refers to the application of algorithms to data in to discern and respond to recognizable patterns.  

In still other cases, cognitive refers to machine learning:  a set of sophisticated programs that evaluate collected data, compare them to data management policies (criteria, standards, etc.) and determine what if any actions to take.

This blog provides a location to learn more about the theory of CDM and the capabilities of the current generation of vendor products portending to provide cognitive data management services.  Ultimately, we agree that the volume of data that is amassing in most organizations already exceeds the capability of human administrators to manage; automated tools are needed to support the effort. 

Let's learn more about CDM and share our experiences with data management generally using this forum.