This is a sketch of how some of the ideas discussed in the core conversations configuration management sessions yesterday might look in practice. Obviously this is just one idea and there are plenty of other valid approaches (overall of for any specific piece) - we don't want to get too specific yet. For example, some of what I describe could probably happily live in contrib. I think my big picture point is that the approaches discussed are not mutually exclusive, and in fact we probably need bits of each of them.

As Greg Dunlap proposed, lets say we added a system of adding reliably unique IDs to "everything" (within reason, but I think content should be included here). These unique IDs should be treated as the canonical IDs. This could look like the machine name field in many cases - sometimes this would be hidden from the user, sometimes there would be other methods of determining this ID (manually or programatically).

We still create/store the numeric IDs as the data is in the database, but store them alongside the canonical ID. The numeric ID is used for primary keys and most joins in the database (for performance reasons), but the canonical ID is king in the area of data existing in any context outside of the database. There is a layer that carefully maps the numeric IDs and references on CRUD operations.

What does this mean so far:

  • It allows us to track both the data, and the references to it, across multiple sites or site instances - we no longer run into the auto-increment ID problem.
  • Data numeric IDs no longer need to appear in the user interface, we can use these identifiers in the default system paths, making the UI more consistent, but also ensuring URLs are consistent and robust across sites (otherwise links to /node/123 URLs would break anytime content is migrated). Of course we would still need some kind of pathauto system that adds more structured aliases reflecting the information architecture, but this would be a layer on top of the system paths.
  • Data numeric IDs no longer need to appear in our web facing APIs - a good RESTful interface should not concern the client with details of the internal implementation. Right now we are pushing the tracking/mapping of IDs across sites out to the API clients, which makes a good number of feeds/deploy type systems highly complex and fragile and time references are pushed.
  • With canonical IDs that match the system paths you could do a JSON GET on the path on one site, then a PUT of the same data to the same system path on a different site, and the system would automatically know if this was a create or update operation based on the ID. This means that the deploy or feeds module use cases could feasibly be handled by a system completely outside of Drupal, which I think is a great benefit in some cases - for example systems (which may be non-Drupal based) responding to external events can make CRUD operations on a site but can also do easy push/pulls of data from one site to another.

So in addition to this, lets say we have a hierarchy of well structured, diffable JSON (pluggable, etc) files. These can represent a canonical, declarative and optional representation of data in the system. Hopefully it is obvious that the canonical unique IDs could greatly simplify the implementation here, as well as making diffs and tools that work with this representation of the data much simpler.

There are a few things (several of which David Strauss hinted at) that I think are important to consider about how this would most likely work:

  • The "cache" of this data that is pulled into the database on refresh is only a cache in the sense that it is not the canonical source of this data. It is not implied that these caches need to be in the form of blobs of serialized php in the cache table - in many cases it will be necessary to represent these in more structured tables that can be queried in useful ways.
  • Also, these caches should never expire individually - it is critical that when the caches are refreshed that this is an atomic operation. Hence the entire set of files needs to be validated to ensure it can be handled by the current code-base, and all the changes need to be made at the same time. If caches are updated in a more ad-hoc way it becomes much harder to manage the ID mapping (even with canonical IDs) in a way that ensures the end result will function - this is especially the case with complex configuration-oriented elements. Refreshing the caches would likely be a user initiated operation (possibly via a deployment system) in most cases, because the refresh will need to be timed with a source code update that references the changes.
  • While it is a big step, I think it should be possible (at least architecturally) to represent any data in this form, even stuff that is clearly content, such as nodes and users. This does raise some issues of course, such as what happens if content is stored in the files that references content that is not (my guess is that we recursively store referenced content too) but this is an issue that will occur even with storing configuration - e.g. views or even the front-page variable can reference tid and nids etc and even switching to canonical IDs doesn't ensure those elements exist on that site.
  • Another potential concern is that it would be horrible for large sites to have to store all content in this way, so it does need to be optional - I imagine a system of excludes (and probably most entities would be excluded by defaults) as well as perhaps an "opt in" for specific items.
  • This does imply some UI changes which will need careful thought - while I think a features style update/revert pattern will make sense for some sites, for other sites you may want to lock items (read only files, perhaps), and of course you have the requirements for bundles, multiple levels of overrides (code < distro < site etc). Even if much of the UI for this may be in contrib, I am pretty sure there will need to be some changes in core to allow users to intuitively control and understand this process.
  • I think it may be worth considering adding some level of self-descriptive structure to the CM storage code (JSON, YAML etc). This would make it much easier to build tools that work with these files directly - such as making them easier to edit directly, or providing more contextual diffs.

I think we made some real progress around this issue yesterday, more than I have felt in previous Drupalcons - and I am excited to see where this takes us in Drupal 8. Lets keep moving the conversation forward!

Owen Barton is Director of Engineering at CivicActions. He has been developing elegant solutions in Drupal for over 12 years and is widely credited with building one of the most reputable and experienced Drupal engineering teams on the planet.