With so many options and a consistently growing number of new technologies becoming available every year, the world of databases and data stores can be confusing or hard to keep up with. And with Oracle’s recent 18c release being mainly based around their autonomous database which may revolutionize the DBA domain all together, we thought this was a good time to revisit what we know about Databases.
Different database types are best suited for different applications or workloads.
Choosing the wrong type of data store may result in not being able to handle or process the data inputs, and can have dire consequences in terms of performance as workloads grow to scale. Not being careful about which database solution to run with can even drive up costs to the point where it is no longer commercially viable to maintain the database.
Therefore this series of blogs, split into three, will serve as a guide on everything we know about database systems, analyze which database solution to implement for each use case, compare some of the most popular DB Solutions for each use case, and specifically highlight some things to consider when running various DB solutions on different clouds. This blog is less technical than some of our others, and serves to provide a high-level analysis aimed at helping IT and Operations Managers make more informed decisions and avoid costly mistakes down the road as business takes off.
Part 1 will cover some of the true database basics including different types of DBs and differences between them, while Part 2 will take a closer look at the different use cases for each type (these two first parts are for those new to DBs or wishing to refresh their memory). Part 3 will analyze and evaluate some of the most common and popular database management systems for each use case and draw technical comparisons between them.
The first section of this blog covers database fundamentals which we acknowledge, have already been covered in a variety of other sources at some length. The reason we revisited these topics here, is because we wanted to create one stand-alone repository that answers all the questions we receive, and with consideration to the increasing complexity of some of the newer database management systems, we thought it would be good to review the basics.
If you know your way around databases well and want to get straight to the practical nitty-gritty, you may want to skip Part 1 and 2, and proceed straight to Part 3.
1. Different Databases Categories: RDBMS vs. OODBMS vs. ORDBMS
Lets start by reviewing some of the basic fundamentals of data storage and processing.
1.1 Relational Database Management Systems (RDBMS)
Most of you will be familiar with the classic form of storing data in relational tables made of rows and columns, where stored pieces of information relate to each other via a given set of attributes or dimensions. These forms of storing data are known as Relational Databases and came into existence in the 1970s.
When creating a relational database, you can define the domain of possible values in a data column and further constraints that may apply to that data value. For example, a domain of possible customers could allow up to ten possible customer names but be constrained in one table to allowing only three of these customer names to be specifiable. The definition of a relational database results in a table of metadata or formal descriptions of the tables, columns, domains, and constraints.
Relational Databases use Structured Query Language (SQL) as their standard language. SQL statements are used to prompt interactive queries or search for specific information within a database and to gather data for reports. RDBMs became popular because they were fast and provided users a powerful way to slice and dice data in any dimension they wanted. Furthermore they were relatively easy to extend, which meant that new categories or dimensions to collect information could be added without requiring that all existing pieces of data be modified. But most importantly relational databases offered transactional consistency.
Software systems used to maintain relational databases are called Relational Database Management Systems (RDBMS), and historically, some of the most popular examples included Oracle DB, MySQL, Microsoft SQL Server, PostgreSQL and MemSQL, although MemSQL is a bit different as the only in-memory DB listed here.
1.2 Object-Oriented Database Management Systems (OODBMs)
An object database is a database management system in which information is represented in the form of objects as used in object-oriented programming. Objects are self-contained components which to contain properties and methods to make certain types of data useful. Objects are more complex than the contents of RDBMS which are relatively simple, and thus object databases are not table-oriented.
In OODBMS, objects are organized into superclasses and subclasses, which can be understood in terms of the is a relationship. A subclass is a more specific instance of a superclass. For example, an orange is a citrus fruit, which is a fruit. A shepherd is a dog, which is an animal. Each superclass is not limited to one, but can have multiple subclasses, such as a Pen class can have Fountain Pen and Ballpoint Pen subclasses to describe different kinds of pen.
Object databases gained popularity in the early 1980s due to growing need to store and process large amounts of more complex data types in an attempt to more accurately model the real world. In strict OO view, the nature of all data is object-oriented, as everything in the world can be modeled as an object. In other words, everything is part of something else which is part of something else. OODBMS allowed for encapsulation, inheritance, abstraction, and polymorphism, and better interfaced with OO languages like C++ or Java.
As a quick recap; Encapsulation describes the inclusion of all the properties, methods, and resources needed for the object to function within that object. Typically, only the object’s own methods can directly inspect or manipulate its fields. Encapsulation means that the internal representation of an object is generally hidden from view outside of the object’s definition. Therefore encapsulation can be used as a language mechanism for restricting direct access to some of the object’s components (also known as information hiding).
Inheritance enables new classes to receive properties or methods of existing classes.
Data abstraction is the reduction of a particular body of data to a simplified representation of the whole. Through the process of abstraction, programmers can hide all but the most relevant data about an object in order to reduce complexity and increase efficiency. Finally, polymorphism refers to a language’s ability to process objects differently depending on their data type or class. This means, no matter what shape an object is, applying the area method to it will return the correct results.
OODBMS contain objects with many-to-many relationship, which are accessed by the use of pointers rather than SQL, which is why OODBMS are a form of Not Only SQL or NoSQL databases. Pointers are linked to objects to establish relationships, and as such, OODBMS can be programmed with small procedural differences without affecting the entire system.
Some of the most popular Object Oriented DBMS include Caché, Db4o, Versant Object Database, ObjectStore, Matisse, and GemStone/S, although their use is by far not as widespread as Relational or Object Relational DBMS. This is most probably due to the fact that a lot of database players don’t support it, or atleast don’t support OODBMS exclusively.
1.3 Object-Relational Database Management Systems (ORDBMS)
Object-relational databases are the direct result of SQL evolution and represent a hybrid of both previous approaches used to store semi-structured data. Object-Relational Databases can be described as object-oriented formats on top of relational technology, in a relational context. ORDBs basically store data in tables of objects rather than in tables of rows, which grants these forms of data stores the advantages of both worlds.
The benefits of object-relational databases include being able to handle complex data types, encapsulation, and inheritance, but also aggregation and complex object extensibility. ORDBMSs allow developers to embed new classes of data objects into the relational data model abstraction (and on top of SQL), but the mixture of both technologies can also result in difficult to understand schemas and result in performance problems.
Some examples of the most commonly used ORDBMS include Oracle Database, PostgreSQL, Microsoft SQL Server, or Informix. More recently, it has become common to refer to ORDBMS as RDBMS, but it is important to understand the difference.
The diagram below helps to conceptually position and map the different database types to their respective domains, types of data stored, and retrieval or access methods.
1.4 Not Only SQL (NoSQL) Database Management Systems
Not Only SQL (NoSQL) or non-relational databases have been around since the early 1960 but have only recently become popular due to growing data complexity and increased data processing requirements for large data sets from the likes of Google, Facebook and other digital giants associated with Web 2.0. NoSQL databases store data not in tables made of rows and columns but rather include data modeled in key-value, wide column, graph, or document stores.
They are sometimes referred to as Not Only SQL to emphasize that they may support lower level SQL-like query languages. NoSQL databases are often used in big data and web application scenarios, because of their flexible data structures, simplicity in design, simpler “horizontal” scaling to clusters of machines and finer control over availability. They generally favour availability, partition tolerance and speed over transactional consistency, so a NoSQL databases’ strength lies in its proper application to suitable use cases.
Common examples of NoSQL databases include MongoDB, Cassandra, DynamoDB, Couchbase and HBase, along with many more being brought to market every year due to their current popularity.
2. Major Differences between Relational and NoSQL Databases
In short, the differences between relational and non-relational databases can be categorized into four main dimensions.
Data models: A NoSQL database lets you build an application without having to define the schema first, unlike relational databases which make you define your schema before you can add any data to the system. Not having a predefined schema makes NoSQL databases much easier to update, as your data and requirements change. This means NoSQL makes agile development easier, as less planning is needed, but it also increases the complexity and thus raises significant new challenges for analysis.
Data structure: Relational databases were built in an era where data was fairly structured and clearly defined by their relationships. NoSQL databases are designed to handle unstructured data (such as texts, social media posts, videos, or emails), which make up much of the data that exists today.
Scaling: It’s much cheaper to scale a NoSQL database than a relational database because you can add capacity by scaling out via inexpensive commodity servers. Relational databases, on the other hand, require a single server to host your entire database. To scale, you need to buy a bigger, more expensive server.
Development model: Most NoSQL databases are open source whereas relational databases typically are closed source with licensing fees baked into the use of their software. With NoSQL, you can get started on a project without any heavy investments in software fees upfront.
3. Differences within Non-Relational DBs (not all NoSQL is NoSQL)
While these general distinctions can be made between relational (RDBMS) and non-relational DBMS, there are significant differences between types of Not Only SQL Databases (NoSQL), you should consider when formulating a data store and management strategy. The most common types of not strictly SQL DBMS include Key-Value stores, Document stores, Columnar databases, and Graph Databases, each of which we will briefly examine in more detail.
3.1. Key-value Stores (ie. Redis, DynamoDB, Cosmos DB)
Key-value stores are probably the simplest form of database management systems. They can only store pairs of keys and values, as well as retrieve values when a key is known.
In a key-value store, there is no schema and the value of the data is opaque. Values are identified and accessed via a key, and stored values can be numbers, strings, counters, JSON, XML, HTML, binaries, images, short videos, and more. It is the most flexible NoSQL model because the application has complete control over what is stored in the value.
These simple systems are normally not adequate for complex applications. On the other hand, it is exactly this simplicity that makes these systems attractive in certain circumstances. For example resource-efficient key-value stores are often applied in embedded systems or as high performance in-process databases.
An extended form of key-value stores is able to sort the keys, and thus enables range queries as well as an ordered processing of keys. Many systems provide further extensions so that we see a fairly seamless transition to document stores and wide column stores.
Important examples of Key Value stores include Redis, Amazon DynamoDB, Memcached
Microsoft Azure Cosmos DB, and Hazelcast.
3.2. Document Stores (ie. Mongo DB, CouchBase)
Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data. This means that records do not need to have a uniform structure (different records may have different fields or entries), and the types of the values of individual columns can be different for each record. In document stores, columns can have more than one value (arrays) and records often have a nested structure.
Whilst, objects have methods, predefined schema, and inheritance hierarchies, document stores do not include this type of code. Although some relationships between documents may exist, document stores do not persist entire “graphs” of objects and pointers between documents are deemphasized.
Relationships are deemphasized because the goal of document stores is to maintain the client-server paradigm and keep code biased to the client. Arbitrary graphs would make it difficult to process graphs from a client without many client-server turnarounds – making it significantly slower and prone to error. Second, with the goal of maximizing horizontal scalability, graphs of objects would make it difficult to partition amongst servers in a manner that guarantees consistent high performance. Consequently document stores do not emphasis or stress relationships between documents.
Document stores often use internal notations, which can be processed directly in applications, mostly JSON. JSON documents of course can also be stored as pure text in key-value stores or relational database systems. That would, however, require client-side processing of the structures, which has the disadvantage that the features offered by document stores (such as secondary indexes) are not available.
Although inherently schema-free, significant value can be derived from creating a schema in document stores for better data management, identification, and extraction. The defined schema may vary from the internal “code schema” of the application, but there may still be an explicit declaration of indexes. For example MongoDB, one of the currently most popular document stores, uses collections (analogous to a table) of JSON documents, and implements an explicit declaration of indexes on specific fields for the collection. This not only allows for better and faster performance, but the decoupling of data and code provides flexibility, as code is prone to change often.
Other important examples of document stores include Couchbase, CouchDB, Amazon DynamoDB and Microsoft Azure CosmosDB, although the latter two are multi-model and not purely document stores.
3.3. Wide-Column Stores (ie. Cassandra, HBase)
Wide column stores, are also called columnar stores, column family databases, or extensible record stores, and store data in records which have the ability to hold very large numbers of dynamic columns. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be also seen as two-dimensional key-value stores.
Wide column stores must not be confused with the column oriented storage model of some RDBMS. Simply put, wide-column stores like Cassandra or HBase store records in column by column format, not in rows of records across tables of columns like in RDBMS. The important thing to understand about column oriented databases is that the data is stored in keyspaces and column families within them. Take a look at the two images below to get a better idea of what we mean:
In our example, the image on the left shows a keyspace containing multiple column families. The image on the right shows a column family, a User Profile for example, containing 3 rows or records, with each record containing its own set of multiple columns.
Both columnar and row-oriented RDBMS can use traditional database query languages like SQL to load data and perform queries. However, storing records in columns enables the database to quickly locate and more precisely collect subsets of data from one record, resulting in significantly improved query performance, particularly across very large datasets.
Columnar databases exhibit better query performance when only a limited subset of data is required per record, as for example in a rolodex application where the first and last names of many rows is collected to build a list of contacts. For deeper operations where all the data for a given record must be retrieved, multiple disk reads would be required in a columnar store, and thus a row-oriented database will be faster.
Some of the most popular examples of columnar stores include Cassandra, HBase, and Azure Cosmos DB.
3.4. Graph Databases (ie. Neo4j, OrientDB)
A graph or graph-oriented database, is a NoSQL database which uses graph theory to store, map and query relationships. Essentially these databases are a collection of nodes and edges, where each node represents an entity (such as a person or business) and each edge represents a connection or relationship between two nodes.
Every node in a graph database is defined by a unique identifier, a set of outgoing and a set of incoming edges, as well as a set of properties expressed as key-value pairs. Each edge also has an identifier, a starting-place and/or ending-place node, and a set of properties.
These modern multi-model databases can provide more functionality and flexibility that other databases and in some cases even be powerful enough to replace traditional DBMSs. However, so far graph databases have not enjoyed as much widespread use as the other 3 NoSQL types yet, although they are rapidly becoming more and more popular.
Some examples of current graph databases include Neo4j, OrientDB, DataStax, and ArangoDB.
The next part of this blog series will consider the relevant use cases for each of the database types and will discuss some real-life application examples.To be continued…