Data Version Control


Over the past several years, I have been making mental notes on how people use and interact with data. Having also been exposed to the Git version control system during that time, it is leading me to the conclusion that we still have yet to develop a version control system for data that properly reflects how people interact with thier data.

Generally speaking, I have noticed the following things about human nature, especially with respect to interacting with data:

  • People tend to work independently, then come back and share / compare their work vs. taking turns working on a single, canonical version.
  • Users are frequently left wondering “how has this file changed from point A to point B?” without any effective ways to answer this without going line by line in different physical versions of the file.
  • People generally use the same software to edit data, but not always.
  • People do not want to deal with connectivity issues when messing with data, or risk a dropped connection. I feel like this is the biggest fallacy of them all - that everyone will someday have infinite connectivity to the internet, and therefore cloud/web based apps can satisfy all needs. While that may be an eventuality, it is a lot further off then I think most developers appreciate.

These are just a couple of the behaviors that lead me to believe that the currently accepted data VCSs such as SharePoint are at best insufficient tools, and more likely creating drag on the collaborative nature of teams.

This is where the Git VCS comes in. Git, while intended for the programming community, provides solutions to the above questions.

  • Every user has a local working copy of the information, meaning:
    • They can edit at will, without having to worry about how their changes will effect someone else.
    • They do not have to deal with connectivity issues until they decide to push their changes back.
  • The Git branching model enforces the following:
    • The onus is on the user to resolve conflicts between their version and the current canonical version.
    • A user doesn’t have to wait until person A “checks back in” a file before applying their updates.
    • It creates a log of all states of the file or sets of files

Why should developers have all the good tools for dealing with computers?

So how would the ideal data VCS behave? Why not just use Git - doesn’t it serve this purpose? The answer IMO is: yes and no. Git does provide a backbone on which you can implement the VCS aspects of this workflow. There, however, is a huge lack of ability to view diffs between different versions of data files (because many of them are binary, or don’t otherwise lend themselves to the use of code diff tools due to their structure), or the ability to merge data files back together in a logical way.

The latter is a problem I am not entirely sure we can ever fully solve, as file formats by their nature are constantly changing, and the risk of writing a broken file is fairly high. It would also require the independent development of an interface to check the merge, and make modifications where necessary. I am not in the business of competing with other tools that are likely better for making edits to the files they create (e.g. Excel for xlsx, ArcGIS for ShapeFiles, etc.). Creating a solid diff foundation, however, feels within reach, if not incredibly complicated to implement. It also would serve the primary need for the adoption of a Git-type VCS for data - the ability to see differences in file versions.

So, I am going to start planning out what this would look like. How would a universal diff-ing tool behave, focused on data types (documents, tables, audio, video, geospatial, etc.)? Generally, here is how I am thinking about it:

  • I am not looking to create an entirely new UI that needs to be supported. It should utilize existing tools as much as possible - tools that are available on basically every system.
  • An artifact should be created during the diff. I think this is important for multiple reasons, but mostly so that the difference between states can be cached, or passed to another individual so that it can be viewed.
  • It is important that components that serve the same need can be compared, even if the file types are not exactly the same. For example:
    • Comparing an xlsx file with a csv
    • Comparing a pdf with a docx
    • Comparing a jpg with a png
  • There are also instances where files can contain subtypes of other primary types. For example: a docx file can comprise of written paragraphs, images, and tables.
  • However, there is no logical use case of comparing types that do not match (e.g. comparing a jpg to a csv), so there is no need to maintain comparable structures for different primary data types.
  • But, this does mean that components of files will have to be translated into their primary types before comparisons can be made. For example: say we have a docx file that contains images, tables, and written prose, and we want to compare that file to an excel file that contains 5 different tables. How would this work?
    • First, both files have matches for certain primary data types - they both have tabular data. This is what we should be comparing.
    • The docx table would need to be translated to it’s primary data type (tabular), same with all tables within the Excel file (multiple tables)
    • Then, ideally the tables that best match eachother would be selected for comparison, although other
  • Maintaining these translators from each secondary data type (e.g. tabular data in an xlsx) to it’s primary data type (tabular) is at best, going to be incredibly difficult, and require a supportive community.
  • Since it is unlikely that the community will always be on top of the latest changes to a data type (e.g. Microsoft adds a new feature to Excel, embedding those changes into the xlsx file type), the system will need to intelligently handle and display things that were “there but not parsable for comparison”, and preferably a sense of where those changes exist within the file.
    • For example, let’s say we haven’t built the primary type for comments in a document yet, so the diff file will not know how to clearly display these changes. The user, however, should not be left in the lurch to guess that there might have been other changes they were not aware of, simply because the translator does not yet support it. Therefore, there will need a way to list (preferably modularized and with a sense of location) the presence of a change that is not parse-able, and some indication of what it contains. (e.g. Unknown type: Comment <comment>20146203 - 'I do not agree with this') to the extent that it can be known.

^ There is a lot here, so one of my biggest concerns is that these operations are going to be incredibly expensive to execute. For example, for a document containing tables, images, and text, conversion for each of the subtypes would have to be done, which will likely blow up the size of the diff file.

Example of how this might look

So, let’s use an example of how these types might work. Assume you have document that looks something like this:

Title: An example document

There is a lot to be said about an example document. For instance, these documents might have tables:

[table]

They may also have images:

[image]

Not to mention [comment] comments on what is being talked about, special <formatting> formatting that might change the way the text looks </formatting>. Or a footnote [footnote] that references something below.

The modular types for this document could be something like the following:

  • [file]
    • [title]
    • [text]
    • [table]
    • [text]
    • [image]
    • [text]
      • [comment]
      • [formatting]
      • [footnote]

There are certain issues with this structure. For one, you need to have way to identify location for subcomponents. Where are the comments located in the parent text component? But, assuming you could take care of this, the partnering excel file that it is being compared to will look something like this:

  • [file]
    • [sheet]
      • [table]
      • [unknown]
      • [unknown]
    • [sheet]
      • [table]
    • [sheet]
      • [table]
    • [sheet]
      • [table]
    • [sheet]
      • [table]

Does open document xml cover all of this? Probably partially. I think the answer is that I should borrow wherever possible from open formats (tabular, document, image, video, sound, sheet music, tablature, GIS, etc.), using converters that are maintained by others to convert from more proprietary formats into the open standard.