When I want to learn about a topic, I like to go through at least two books to get different perspectives and hopefully get a broader and more thorough understanding of the topic. This time I was not disappointed, and I picked two dramatically different books that worked quite well together to cover the vast subject of databases. Database Design for Mere Mortals is a book about how to design a database in a simple, straightforward way. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement is exactly what it sounds like, a book that runs through the features, advantages, and disadvantages of seven different modern databases to give you an overview of what's currently available and what these databases are capable of. Let's take a closer look at what each of these books has to offer.
Database Design for Mere Mortals
Databases are the foundation of many types of software applications, and I wanted to get a working understanding of how to design a database well. Michael Hernandez does an excellent job in this book laying out a process for designing a database with good data integrity and consistency, two important goals when designing a database. His writing style is direct, clear, and pleasant, and I found this book surprisingly easy to read considering it's covering a topic that threatens to be utterly dry and boring.
The book is split into three parts: an overview of relational database design, the details of the design process, and some other database design issues. The first part covers the history of database models, what a relational database is, the design objectives for a database design process, and the terminology of relational databases. The relational database that we know and love was conceived all the way back in 1970 in more or less it's current form with tables, fields, and relationships. Like with so many of the computer innovations of the 60s and 70s, we are standing on the shoulders of giants today with relational databases.
The second part of the book contains the majority of the content and lays out the details of the database design process. The process starts with defining the goals and objectives of the database and analyzing the current database that's in use, if there is one. This step, as well as most of the other steps in the process, involves numerous interviews with various stakeholders of the database, and example interviews are spread throughout this part of the book. The process continues with establishing the table structure of the database, defining the different types of keys for the tables, writing field specifications, designing table relationships, establishing business rules, defining views to represent database queries, and reviewing data integrity.
The third part of the book wraps up with some bad design practices to watch out for and advice for when you might think about bending or breaking the rules of good database design. Throughout the book, Hernandez focuses on the relational database and doesn't even bring up other modern database models, like the models covered in Seven Databases in Seven Weeks, but the process he lays out could be adapted to other types of databases fairly easily.
Early in the book Hernandez takes a decidedly database-centric view of application design by assuming the database is the starting point of an application, and the data structure should be established first with the rest of the application being built up around it. It's worth noting that this is but one way to design an application, and it normally results in an application with a data-centric feel to it. Another way to design an application is to start with the user interface (UI) and user experience (UX) and design the database to work with the application structure that results. Such an application will have a much more user-centric look and feel. Both approaches have their own benefits and issues, and of course a real project will likely be a hybrid of these two extremes because the design of the database and user interface is not going to be a serial process.
As for the database design process itself, I was surprised at how similar it is to good software design. Much of good database design boils down to eliminating duplication in the database, choosing descriptive names for everything in the database, and making the database flexible enough to change easily. Good software design will follow these same principles. Most of the book felt very familiar except that the terminology referred to databases instead of programming languages.
The reason that duplication in a database is a bad thing is that if the same field is replicated in multiple tables, then the data for those fields needs to be constantly kept in sync. It becomes much more likely that the data will be updated in one place and not the other, causing data integrity problems. The process of eliminating duplication is referred to as normalizing the database, but using Hernandez' design methodology, duplication is never allowed into the database, so normalization shouldn't be necessary. The equivalent principle in software design is the DRY principle, and it is usually achieved through refactoring.
Naming is one of the hardest problems in software design, and it seems from this book, the same is true of database design. At least half of the book is about how to name various database structures, and many times naming a table or field will provide clues as to how it should be changed to make the design better. One place where naming can reveal design issues is when a table has a multi-part field. A multi-part field is a field with values that have more than one distinct subject. The canonical example is an address field that contains full addresses as string values. An address consists of a street name, city, state, and zip code. A better design would have separate fields for each of these parts of the address.
While the advice on naming was generally good, and I learned quite a lot about subtle naming issues in database design, some of the advice struck me as odd. Hernandez recommends using an abbreviated form of the table name as a prefix to disambiguate field names that would otherwise be the same across tables, like when there are address fields for employees, customers, and suppliers in their respective tables. He advises having names like "EmpAddress," "CustAddress," and "SuppAddress." Later he abbreviates things like "Classes" to "Cls" and "Student" to "Std," which makes even less sense to me. I would think this practice would add confusion, and adds mental effort by having to do the translation in your head every time you use these abbreviated fields. I've come to appreciate the advantages of not abbreviating names in programming, and the cost of a little extra typing is not terribly significant. I would prefer the unabbreviated names for clarity.
Another issue I had with this design process was how much it seemed like a Big Up-Front Design (BUFD) process. The entire database design is decided in the beginning through meetings and interviews, and Hernandez recommends everything be documented in paper folders. And I mean everything:
Once you’ve identified which web pages you need to review, take a screenshot of each page. Copy the screenshots into a word processing document, print the document, and then store the document in a folder for later use.It seemed quite rigid and optimistic to me with how much requirements can change during a development project. Maybe there's something about database design that requires this type of BUFD, but I can see why programmers can get frustrated with the database team if this is how they have to operate. I wonder if there's a way to make the design process more flexible and (dare I say) agile without sacrificing data integrity and consistency in the database design.
Despite these misgivings, Database Design for Mere Mortals was an excellent introduction to database design overall. Hernandez' approach was simple, straightforward, and methodical, and he did a great job of laying out a process that should be accessible to most people. In his own words, database design isn't hard:
I’ve always believed that you shouldn’t have to be a rocket scientist in order to design a database properly. It should be a relatively straightforward task that can be performed by anyone possessing a good amount of common sense. As long as you follow a good database design method, you should be able to design a sound and reliable database structure.I always thought that to be true, from what little I knew about databases, and it's good to have the confirmation from an expert. I definitely recommend this book to anyone looking for a good way to design a database effectively. Maybe you can even find a way to make it more adaptable to change.
Seven Databases in Seven Weeks
I've been wanting to read one of these Seven X in Seven Weeks books for a while now, and this is the first one I was able to get to. It was excellent. I'm amazed by how much Eric Redmond and Jim Wilson were able to fit into 350 pages, and it's entirely coherent while being an immensely enjoyable read. They do actually cover seven databases in reasonable detail in this book, dedicating roughly 45 pages to each database.
The basic format of the book is one chapter per database, and each chapter is split into three sections representing three days of study. The first day covers the basic API and main features of the database. The second day goes through the more advanced features of the database that make it unique and interesting. The third day does something more involved using a large data set to populate the database and showing what kinds of structures and queries are best suited to that particular database. The end of each day includes some homework exercises to guide the reader in learning more about how each database works.
I thought this format worked extremely well, although the "seven weeks" in the title makes it seem like it would take longer to get through the book than it actually does. Since each chapter consists of three days, you could finish everything in three weeks if you did something everyday. Each day's work is also short enough that it is conceivable that you could squeeze seven databases into seven days instead of weeks, but you wouldn't be able to do anything else other than eat, sleep, and study to do it, so probably not realistic.
At any rate it's a great way to learn something significant about seven databases in a short amount of time. You'll learn their strengths and weaknesses and how to interface with them through code as well as the provided command line or HTTP interface. If you were looking for a new database to use on a project, this book would give you all the right information to figure out which database would work best as well as get you started on using it for your project.
I've come this far without even mentioning what the seven databases are, so I'll stop raving about the book and briefly describe each database that the authors cover.
PostgreSQL is a traditional relational database that has proven itself many times over on large software projects. As a relational database, the basic structure of the data it stores is in the form of tables that are made up of fields that describe characteristics of the data and rows that hold the values of those characteristics for specific entries in a table. Tables can be linked together with uniquely-identified primary keys, and queries can filter and sort the table data and use keys to find linked data in very flexible ways.
PostgreSQL has the advantages of being fast and having great driver support in most programming languages. It's a mature, reliable database with decades of development behind it, and being a relational database, most developers will be familiar with how it stores data and how to query it. As for weaknesses, it does suffer from scalability issues if you need to partition it onto multiple servers. Most relational databases don't handle partitioning nearly as easily as other types of databases. Also, if your data doesn't fit well into a rigid schema structure or mostly contains large blobs of data, PostgreSQL probably isn't the right database for your data.
Riak is a distributed key-value store, which means it's built from the ground up to run on multiple servers, and it stores and retrieves data using unique keys for each piece of data. The data can be anything from plain text to video clips, and the keys can be automatically generated or manually assigned. Data can be further divided into buckets to separate key-value pairs in a logical way. The interface to Riak is HTTP requests, and all the normal CRUD (Create, Read, Update, and Delete) operations are done through a slick REST interface using URLs so it's easy to interact with Riak through any programming language's HTTP library.
Because Riak is distributed, it cannot be fully consistent, available, and partition tolerant at the same time. This type of trade-off pops up in all kinds of places. In project management you can't optimize fully for time, cost, and features. In programming you can't optimize code fully for performance, readability, and development time. You have to pick two of the three options. Riak steps around this problem by allowing you to pick two options for each request to the server. One write to the database could be consistent and available, and the next write could be available and partition tolerant instead. That's cool.
The authors' description introduction to HBase is pretty good:
Apache HBase is made for big jobs, like a nail gun. You would never use HBase to catalog your corporate sales list, just like you’d never use a nail gun to build a dollhouse. If your data is not measured by many gigabytes, you probably need a smaller tool.It's a column-oriented database similar to Google's BigTable database, and it's designed to be fault tolerant and highly scalable. The structure of the database is a set of tables with each table containing rows that are kept sorted by the row key. The rows are divided into column families, and each column family in a row can have it's own set of columns. The same column family in two different rows can contain different columns. Every change to a row will create a new version of the row, so the database provides revision control for free.
Being a highly scalable database, HBase's major weakness is that it's not for small data sets. Like the quote says, if you're not dealing in huge amounts of data, HBase is going to be overkill.
MongoDB is a document database. Instead of storing data in tables, columns, and rows or key-value pairs, it stores data in documents. A document is a JSON string that contains an _id field and any number of other fields with a name and a corresponding value. That value can be a string, a number, an array, or even another hash, which is what the base document is. There is no defined schema, so each document can be completely different than the others, although in practice such an ad hoc database would be unusable. The application is responsible for maintaining the data structure that it requires.
MongoDB is another scalable database designed for storing massive amounts of data and running on large clusters of machines. It has more flexible querying capabilities than Riak or HBase, but you can easily shoot yourself in the foot with its lack of schemas. The application's data model has to be well defined, or else the database can quickly become a humongous mess.
CouchDB is a document database like MongoDB, and it has a REST interface like Riak. It was designed to be simple and flexible, and it can run on almost anything—from large data centers to your smart phone. It never deletes data stored to it. When you update a document, it creates a new version of it so you get revision control for free, and this mechanism is also the way CouchDB ensures data consistency in a distributed system.
Neo4j is an entirely different type of database, as the authors describe:
Neo4j is a new type of NoSQL datastore called a graph database. As the name implies, it stores data as a graph (in the mathematical sense). It’s known for being “whiteboard friendly,” meaning if you can draw a design as boxes and lines on a whiteboard, you can store it in Neo4j. Neo4j focuses more on the relationships between values than on the commonalities among sets of values (such as collections of documents or tables of rows). In this way, it can store highly variable data in a natural and straightforward way.It would be perfect for storing a family tree or, as explored in the book, actor connections in movies so that you could query for actors six degrees from Kevin Bacon. It's an interesting departure from most other databases, and it's great for researching certain kinds of data problems.
Redis is another key-value store like Riak, but its defining characteristic is its speed. Redis is fast, and if it's configured to be a purely in-memory store, it's blazing fast. It's more like a cache than a database when set up this way, and normally it is backed by one or more other databases. Redis is used to speed up access to other databases more so than storing data itself. It also has a very fully-featured command line interface with over 120 commands that make it easy to use and integrate into a larger system.
These summaries give you an inkling of what's covered in Seven Databases in Seven Weeks, but I can't begin to do any of these databases justice in this short space. The authors even claim that they only scratch the surface, although I would say they do a phenomenal job of it. It was quite an entertaining read, and I finished it with a much better understanding of the wide variety of databases out there and what their strengths and weaknesses are. The appendix even includes a great comparison table of the different features and capabilities of the seven databases, a great reference to come back to later. I highly recommend this book for anyone who needs to find a new database to solve a gnarly problem or who is curious about the different kinds of databases available beyond the ubiquitous relational database.
A Crash Course in Databases
I set out to learn some useful things about databases, and with these two books, I definitely accomplished that. Database Design for Mere Mortals gave me a good understanding of the issues involved in designing for a relational database and how to solve common database problems. The design process should be fairly easily adaptable to other types of databases as well, and it gives a clear picture of where things can go wrong that will come in handy when you're designing for a schema-less database. Seven Databases in Seven Weeks exposed me to a wide variety of databases in a great hands-on way that was incredibly engaging. I'm looking forward to reading more of the Seven X in Seven Weeks series, and I hope they're all as good as this one.
Between the two books, I learned a ton of stuff about databases, but not everything, of course. There wasn't much coverage of relational algebra or the implementation of some of the fundamental features of databases, like indexing or storage management. I'll have to find a different book for those topics, but these two books were excellent in what they covered. They complement each other well with almost no overlap in material, and together they're a great crash course in databases.