Why Graph?

The bulk of my experience with application development has been building workflow-related rich client and web applications on NoSQL databases, typically IBM Domino. The challenge in the Notes Client was to provide dashboard-style displays and a good way to display documents for action by the current individual. Private views can be used, but impact database performance. So, typically, the approach is to display views that present a scrollable table of data. Domino's document-level reader security is then used to ensure only the appropriate data is visible. If data is archived appropriately, performance of the database is good enough for many reasonably-sized applications. (Of course, archiving is often omitted from scope of the first phase for the rapidly-developed application, and becomes a case of "out of sight, out of mind".) But with the increasing prevalence of web applications replacing Notes Client applications, the ability to display "my documents" and use structured searches to display a targeted subset of documents was much easier.

NoSQL datastores with a document-driven API are something I'm very familiar with, and equally familiar with the criticism and challenge that they lead to content being duplicated on different document types, which can lead to inconsistency. For all the criticisms, the flexibility of schema has led to a degree of popularity in NoSQL that relational databases could not ignore, resulting in the rise of NewSQL databases.

But another approach has gained ground in recent years, particularly for large datasets - graph databases. Over the last couple of years I've been introduced to graph databases via the Apache Tinkerpop API and an open source implementation for IBM Domino by Nathan T. Freeman in OpenNTF Domino API. As a co-developer of the project I was an early adopter and it became an attractive way of building the data architecture.

The approach of vertices and edges works nicely for workflow-related applications. Storing approvals on the document type being approved was a typical approach for document-driven databases but an alternative was having separate "child" documents for each approval. Separate child documents gives greater flexibility and scalability (what if an additional level is needed or a level removed?). But approvals have to be related both to the object being approved and to the individual approving. Approval then requires access via each approval document (e.g. InvoiceApproval), but overall review requires access via the document type being approved (e.g. Invoice). Graph gives that flexibility, by having the InvoiceApproval as a vertex connected via separate edges to both the Invoice and the Person. Alternatively, RequiresApproval, Approved or Rejected could be different edges between the Person and the Invoice. If the edge can contain properties of its own, a submission can generate a RequiresApproval edge, which is replaced by an Approved or Rejected edge as appropriate, with dates and comments stored as properties of the edge.

Similarly, graph databases fit nicely for social data, like comments on a blog post. The graph approach allows the Comment vertex to have edges to the BlogPost, the Author, as well as anyone Mentioned.

Although it's not something I've implemented, this multi-directional approach could also add benefits to traditionally hierarchical systems like CRM applications. Person A works for Company X at Location N. The hierarchy of Company-Location-Person worked nicely when everyone worked at a main office. But what if they work from home and don't want your system to have their home address? Do you store them against a dummy "Home" Location? Or against an office they sometimes go to? If so, do you add some comment to say that they actually work from home? Then there's the scenario of Person A leaving to work for Company Y, who you also work with. What happens to their interactions with you when they were at Company X? Yes, they're relevant for whoever takes over their position at Company X, but they may also be relevant to the business you want to do with them at Company Y. Do you duplicate the interactions? Graph would allow them to be in both places, all interactions connected to Person A, the interactions prior to their move connected to Company X and the ones afterwards connected to Company Y.

Once you understand how to model your data in this vertex-edge structure, it becomes quite easy to construct the architecture. But the strength of the OpenNTF Domino API implementation is two-fold.

Firstly, it was built prior to Tinkerpop 3 and so uses framed graphs. This means vertices and edges can be converted to Java objects by just defining a Java interface. The core code manages the getters and setters and other methods. This speeds up development quite a bit and also removes the need to deal directly with vertices and edges, unless you wish to.

Secondly, because it's built on Domino, it means the database is multi-model. Because Domino is NoSQL, the content is accessible as Documents. At the same time, because the OpenNTF Domino API makes the Document class extend Map, the content is also accessible by querying the documents as Java Maps. And because it's using Tinkerpop, you can build the databases using a Graph API. Then by using Proxy Vertices, existing Domino documents can be extended into the Graph API by adding a "wrapper document", so the graph wrapper document is queried for properties and, if they're not found, the core document is queried. This is particularly relevant where the core document should not be modified, for example when dealing with the Person documents managed by the core Domino server and administration processes. This gives a wealth of flexibility, for creating applications using a NoSQL database design or a graph database design, and even combining the two.

As a result, when it comes to re-evaluating what database backend to use going forward, graph is my preferred approach, particularly once based on Tinkerpop, with which I've become familiar. The next steps from there have become a little more challenging though.