Friday, April 29, 2011

Database design for text revisions

This question contains some excellent coverage of how to design a database history/revision scheme for data like numbers or multiple choice fields.

However, there is not much discussion of large text fields, as often found in blog/Q&A/wiki/document type systems.

So, what would be considered good practice for storing the history of a text field in a database based editing system? Is storing it in the database even a good idea?

From stackoverflow
  • I develop a wiki engine and page/article revisions are stored in a database table. Each revision has a sequential revision number, while the "current" revision is marked with -1 (just to avoid NULL).

    Revision text is stored as-is, not diffed or something like that.

    I think that performance is not a problem because you are not likely to access older revisions very frequently.

    dr. evil : I have done the same thing, so far works fine for me. The system is around for 2 years now.
  • Often the most sensible way of tracking the versions of a document is to keep track of the changes made to it. Then, if a particular version is requested it can be rebuilt from the current document and the partial set of changes.

    So if you have a good method of describing the types of changes to a document (this will depend largely on what the document is and how it used) then by all means use a database to track the changes and therefore the versions.

    Dario Solera : Storing only the differences between revisions has the advantage of saving space, but it slows the process of retrieving/rebuilding a specific revision and it also increases the chances of data corruption (what if a revision is lost?).
    CurtainDog : Yes, those are valid points. The optimal approach is somewhere in between and I imagine would be fairly sensitive to the type data I think. In the case of somebody correcting the spelling of a word or two, it seems like overkill to store the entire document again.
    Mocky : @Dario I agree, and upvoted, your answer to this question, but I do not agree with your comment here. Fear of "losing" data from your database is not good reasoning and slowing the process of retrieving a specific revision is less of an issue, as you yourself point out, because you are not likely to access older versions frequently.
  • Given the current state of HDD art, it just does not worth the effort trying to optimize text storage mechanisms: Document (ID, Name) and DocumentRevision (ID, DocumentID, Contents) tables will do the job. the ID in DocumentRevision may also serve as a "repository"-wide revision number. If this is not the behavior you want, assign a separate VersionID to each Document Revision.

0 comments:

Post a Comment