I recently began designing a generalized front-end to Fedora, with the goal of allowing domain experts to easily input, share and re-use digital content while at the same time allowing information specialists to design the metadata framework used for those digital objects. In other words, the system can be thought of as a digital asset management system for faculty, facilitated by librarians.
Within the system, digital objects are complex, composed of a set of metadata and one or more files of various types. A digital object can belong to more than one collection, and a basic set of Dublic Core (DC) metadata can be assumed for every digital object, along with one or more metadata profiles, or templates that are defined by collection.
The metadata profiles will be designed by librarians and implemented with the help of developers informed by the needs of the faculty populating the collection. These metadata profiles will be designed to accommodate an arbitrary number of fields of arbitrary types, so the structure of any collection’s metadata can’t be known at (system) design time, as specifying this structure is a function of the system itself. Designing the data model for this aspect of the system began as an interesting puzzle that as I unraveled it, started to look like an increasingly elaborate version of the Entity-Attribute-Value (EAV) model, as shown in this partial ER diagram:
(I created this ER diagram in Open Office Impress, and the diamond arrow heads were the closest thing I could find to crow’s feet.)
EAV models are a popular method of storing what are essentially key-value pairs, and the approach is commonly used in applications for storing global configuration information. I have especially seen this technique widely used in PHP applications, where it is implemented as a single table or as a few fields within a larger, existing table.
There are a number of disadvantages to modeling data in this way, and essentially it boils down to storing data in a way that is not optimized for a relational database. Therefore, you will have to code and maintain aspects of the data normally handled automatically by the database.
For example, data integrity issues have to be thought through. In my case, changes to value choices may involve either updating or deleting many value records manually. And to be sure I can accommodate arbitrary values of any type, I will likely define values as a long varchar field, which will mean that I can store large amounts of text, but that the number ’1′ might also be stored as a single value, and I would have to handle its type information carefully.
Querying is also not as straightforward, and the speed of inserts and updates will be slower as what would normally be one database record now consists of an arbitrary number of records. But, I am hopeful that careful planning and clever user interface design–especially an intelligent use of AJAX–can mitigate the performance issues and make the underlying complexity of this approach transparent in the interface.
And its not like using relational databases in ways they weren’t intended is anything new. We have seen this when object oriented concepts and full text indexing were added to databases, as well as when XML and spatial data capabilities are added to databases. In fact, other alternatives I considered included storing the values as XML data within the digital_object table in a single long varchar field for each record, or perhaps even storing serialized objects as a binary BLOB field in the digital_object table. However, these approaches seem to have many of the same limitations as the EAV approach, and in fact, may make the data values more difficult to retrieve and manipulate, especially in a relational database that does not natively support an XML data type, for example. So, although I’m aware of the limitations of EAV, and I’m uneasy about certain aspects of it from a design standpoint, it seems to be the best option for handling my requirements in which managing the digital object metadata will likely be the largest core feature of this application.
And many of the limitations of EAV may not directly apply in my application, which will be primarily focused on managing one digital object at a time, with browsing likely to be based on things like collections or tags, in which case, only the common DC metadata would be retrieved from the database for display. And, I was already anticipating using a full text search engine like Lucene anyway, so many of the above difficulties not already addressed should be less relevant.
The current anticipated deployment will be into JRuby on Rails (or would that be Rails on JRuby?), so somewhat ironically, I expect that the dynamically typed nature of the language will help me deal with different data types more efficiently. I also suspect that I can find a concise way to handle validation in the code also.
Its interesting to note that there are two Rails plugins that appear to implement the EAV model: Acts as Configurable and Flex Attributes. These take a much simpler approach to EAV and seem to be designed to handle the common situation I mentioned above of storing global configuration data in the database, although my current limited knowledge of Ruby and Rails prevents me from fairly evaluating either of them.
Ultimately, this design is likely to continue evolving in unexpected ways. I would be interested in people’s reactions to using the EAV approach in the context I have described, especially if those people have implemented similar features in applications in the past. I would be especially interested in hearing about another, hopefully better design alternative that would still meet the basic requirements I have described above.