Every software system will have some aspect of data storage and retrieval. Data is invaluable and despite all the data protection requirements everywhere, there will be some element of data that has to be stored for a certain period or time and could be forgotten beyond that, while other types of data might have to be retrieved frequently and would be expected to be fetched so fast in order to minimise latency.
Why not store all the data in a file or a collection of files and just read from there instead? Databases do that for you, but enable you to have a sensible model of the data that lets you query it with some meaning. Imagine searching through a text file for data or think about how you’d make meaningful decisions based on data in a bunch of files! Databases allow you to view that raw data differently.
Databases vs Files - similarities and differences
Databases offer a variety of features that you just cannot get by putting plain text data in files. These are in short some of the advantages of a database.
- Ability to deal with extremely large volumes of data, like petabytes!
- Ability to concurrently read the same piece of information by several users. Ensure that the data is updated and stored consistently. Without data consistency, you wouldn’t trust data from your database.
- Security features like the ability to assign role based access privileges to users of the database.
- Ability to search for specific dataset based on a criteria or condition quickly.
- Ability to scale based on the vast amount of data that has to be stored. You could partition across nodes to improve parallel data processing and there by improving performance.
- Availability - databases can be configured to be backed-up and mirrored to ensure that in case something happens to one of the instances , the data is still available for consumption
What is a database?
A database is an organised collection of structured information or raw data stored electronically in a computer. They are designed to make storage, retrieval and modification, including deletion of related data easy for the purpose of data processing. You cannot escape a database system. Your banks, grocery shops, trendy clothes shops, all of it, use databases behind the scenes. Fascinating stuff, if you ask me.
What are the different types of databases?
There are many different types of databases. However, you could broadly classify into SQL databases and NoSql databases. There are many more too.
SQL databases are databases that you interact with using Structured Query Language.
Info
SQL is a standard database language which is used to create, maintain and retrieve information from a relational database.
There are several more specific types of databases like object-oriented databases and graph databases. Then there are databases that are great for extremely fast querying and analysis called Data warehouses! More recently vendors have been coming up something called a multi-model database - they combine multiple different database models - relational, object-oriented, non-relational and unstructured all into one unified back-end and provides an API that can be used to read, modify and update these databases.
Relational
In this type of database, data is stored and retrieved as relational entities. It all comes down to data modelling. In this type of database, data is modelled and stored as entities that are related to one another.
Unstructured data
NoSQL databases are best suited for storing data of unstructured type. Like you would imagine storing random text files together grouped by a folder or something logical. But the data isn’t structured enough to create relationships between them.
Both relational and non-relational databases play in important role in a software system. You might need one or the other for your use case. Sometimes you might just have to store data in different forms in different types of databases based on what you intend to use the data for.
Thus one type of database is not superior to the other. They complement one another and are good at certain things that the other is not good at.
Relational vs non-relational - a deep dive
So far I have given a brief description of what these things are. Let us now take a deeper look at what they really mean.
Relational databases
These databases became dominant in the industry in the 1980s. They grew popular because of their simplicity, flexibility in managing generic information, performance and scalability. It is easier to reason with entities that have relations and are similar to real world entities.
The data in this type of database is stored as entities with relations to other entities as described earlier. Each entity is stored is modelled as a table where data about that entity is stored. And these tables are related to other tables. Thus when it comes to fetching related data, all the related data is joined and fetched as one.
Every entity would have a unique identifier that sets it apart from the rest which enables linking of that entity’s data with related information in other tables.
A record in a table is an entity, the unique identifier is called the primary key and the key used to define a relationship with another entity is called a foreign key.
As touched briefly earlier, SQL is the interface used to do any operations to this type of database.
ACID properties
A database transaction is a single unit of work that involves multiple tables to be updated in the database. If a database transaction is successful, then all the related entities are updated. If it failed, the database has to ensure that the failure still resulted in consistent data in the database. Not all databases take care of this bit - or adhere to this.
ACID properties is a term defined in database theory that describes certain rules and procedures for handling transactions. If a database adheres to this, then the data in the system is said to be reliable and consistent.
ACID is an acronym that expands to Atomicity, Consistency, Isolation and Durability.
- Atomicity - Transaction is either executed completely or not executed at all. At no point in time will a transaction fail with a partial update.
- Consistency - A successful transaction moves a database from one consistent state to another
- Isolation - A database is not updated by just one transaction at a time. There maybe multiple concurrent transactions happening at the same time. Thus it is important to ensure that the final state must be the same as if the transactions occurred separately. This one is easier to explain with an example rather than just a line here.
Info
Isolation Example
Consider two values stored in the database: a = 10 and b = 20. Introducing T and T’ as two different but concurrent transactions.
T does the following:
- read a
- divide a by 2 to create new a; a = 5
- write a
- read b
- multiply a and b to create new b; b = 100
- write b; 100
At the same time T’ does the following:
- Read a; a = ? 10 or 5?
- read b; b = ? 20 or 100?
- calculate sum of a and b to c;
- write c
The values that T’ works with would depend on whether T got completed before T’ or not. If both T and T’ started at the same time, then both might read a as 10. But by the time T gets to multiple a and b, b could have been 100 or 20.
Isolation ensures that both T and T’ sees the same values of a and b to begin with and only sees the new value after one of the transactions has completed successfully.
Some databases allow you to configure isolation levels in such a way that you can read transient values from other transactions which might be something your application needs based on what it does.
- Durability - The data within the database must only change as a result of a transaction not because the database system was updated.
Advantages of Relational Databases
When it comes to structure data, relational databases are the default choice by software professionals. We already took a look at the ACID properties and how it is something that relational databases do.
The database schema is flexible and can be changed anytime using Data definition language. This allows modifications of tables, columns, datatypes, etc.
By storing data as entities that are related to one another and linked using unique identifiers reduces the need to duplicate data in multiple places, thereby reducing the need to update the same information multiple times.
The way relational databases handle transactions allows concurrency to be handled without conflicting. This also makes it possible to roll a change back to a certain point or to commit after a set of operations.
With data consistency, it is easy to import and export data to facilitate backup and restore. Most relational databases can be configured to do continuous mirroring to avoid loss of data.
Disadvantages
One of the biggest downsides of using a relational database is Impedance mismatch.
Impedance mismatch
As your applications get complex you abstract the layer that talks to the relational database that does all the fetching and updates. This layer often called an Object Relational Mapper is an important part of modern web application development.
The data represented in memory are as objects with properties, however we all know that data in the database is represented as tables connected through unique identifiers called foreign keys.
Not all programming language data-types accurately represent the data-types supported by the underlying database.
Thus the ORM has to do a lot of weight-lifting to ensure that the objects in memory accurately reflects what was in the database and similarly when a transaction is completed, the object is transferred accurately to the corresponding tables in the database.
In order to reduce this, sometimes people use stored procedures or specialised database programming languages to do such operations.
Summary
That was a lot of information about databases and particularly relational databases. Let us explore NoSQL databases and some other related concepts next.