1.4 Databases

Databases are an important part of informational technology since the beginning of the sector. Databases are organized collections of data and can either be saved in a file system or hosted on computer clusters or cloud storage. In this chapter we are going to familiarize ourselves with different database paradigms, so we can choose the right paradigm for the job. If you want to learn more about databases and transactions, you can read about it in Kleppmann, Martin: Designing Data-Intensive Applications(2017), 1. edition, Chapter 7: Transactions

Types of databases

In this video we have learned about 7 different database paradigms. We have summarized them in the table below.

Paradigm	Description	Example	Pros/Cons	Best for
Key-Value	One Key is assigned one value, almost like an object	redis, memcached, etcd	stores in machines memory, which makes it fast but limited in space	caching, pub/sub, leaderboards
Wide Column	Similar to Key-Value, but with a second dimension of ordered rows	Cassandra, HBase	can handle unstructured data,scales horizontally	time-series, historical records, high write, low read
Document	Documents, which are containers for key-value pairs, which then can be structured and organised	mongoDB, dynamoDB, couchDB	can handle unstructured data, do not support joins, schema-less	mobile games, IoT, content management
Relational	IDs from different subtables are referenced in higher hierarchy tables to organize data	MySQL, PostgreSQL	works with SQL to access and write data into the base, ACID compliant, requires a schema, difficult to scale	popular for most applications
Graph	Data is represented as nodes and edges in between the nodes	Neo4J, DGraph	high performance on large datasets	Graphs, Knowledge Graphs, recommendation engines
Full Text Search	Similar to document database, but the search database will analyze the text in the document to create an Index	Lucene, Solr, Algolia	can handle typos, can filter results, expensive at scale, very helpful	search engines
Multimodel	combines multiple paradigms using GraphQL, after providing parameters it will figure out itself which paradigms it uses	FaunaDB	high flexibility, low overhead	everything?

We hope you could take some new information with you and next time when you need to decide on a database, you can make an informed decision. One example of making a decision on what database to use can be read in the UMH blog article: Why we chose TimescaleDB over InfluxDB.

ACID

ACID is an acronym for a set of properties, which database transactions should fulfill to ensure data validity, even if error or even power failures occur. This is required for many business critical applications, where loss of data is unacceptable. For example data validity needs to be absolutely guaranteed when dealing with banking transactions. ACID properties can be somewhat lifted, when transactions are not all that important, while increasing performance. One example would be when logging site metrics, where losing a minuscule amount of metric logs is not all that important. As ACID is not a formal requirement, but more of a way to remember, which properties you need to watch out for when designing how your transactions work.

Andreas Reuter and Theo Härder named the ACID properties in 1983. Many SQl Databases ensure data validity through ACID.

In the following video, you can educate yourself about each of the four ACID properties. Or you can read about it below.

A - Atomicity

Transactions frequently are compromised of multiple different single changes. As these are done sequentially, they may invalidate data if the transaction fails after some, but not all, changes are committed when the transaction is disrupted. For example, if we are using a bank transaction we could have at least 2 database changes: the removal of currency from one account and the addition of currency to another entry. If the transaction is aborted after the removal, but before the addition, the transaction functionally deleted money from the database, which is an issue and invalidates the account balance on either account.

Now, how can we prevent that? By ensuring atomicity. Atomicity means that transactions cannot be divided, they either go through fully or not at all. One way to guarantee that is by taking a snapshot copy of the relevant database entries before the transaction starts. If the transaction goes through properly, no additional action needs to be done. But if the transaction fails. the entries in the snapshot can be reintroduced to make it seem like the transaction never started. Because the transactions operate on a snapshot of the database entries, this also adds another bonus: transactions cannot influence each other.

C - Consistency

Consistency means that at the start and the end of each transaction the database is in a state, that is valid to the restrictions that are set. If we continue with the example, then the database would most likely have a constraint that the sum of currencies of all affected accounts is the same before and after. These constraints need to be upheld for the database to be valid.

I - Isolation

This property is something we already mentioned a little when discussing atomicity. Isolation means that transactions cannot influence each other or see their current state. This means when a new transaction starts it either sees data states as if transactions regarding it have not begun or if they are completed entirely. Low isolation is sometimes used because it allows for more concurrent transactions. However, databases with low isolation show some issues. For example phenomena such as dirty reads can occur. This means for example if one transaction wants to read out data and another transaction changes, but not commits the data, transaction 1 can read the new data, even if the transaction 2 could still be rolled back. If the second transaction is rolled back, then transaction 1 will have read data that was never committed and therefore never valid. To ensure high isolation, while also allowing for as much concurrent transaction without compromising said isolation, concurrency control algorithms are in place to check which transaction can be allowed to work concurrently and which on do not.

D - Durability

Databases all want the D… urability. Durability in this context means, that if a transaction is committed, it will permanently change the database. If I use a transaction to book a hotel room, the booking needs to be in the database, even in a case of another faulty transaction or even a power outage. One way to ensure that is keeping a transaction log, which can be utilized in case the database is damaged.

Open Source Databases

Open Source databases like MySQL, PostgreSQL or MongoDB are becoming more and more popular. As many Open Source applications we have already covered, they are all defined by their openly available source code, which can be modified as one wants. This has a couple advantages and disadvantages compared to proprietary databases, like Microsoft SQL Server or IBM DB2.

We have covered something similar in one of our most popular blog posts. Please note, that the comparison in the blog article is in particular to historians, which are specific proprietary time-series databases specifically designed for OT.

Historians vs Open-Source databases - which is better?

Deciding on a database

Deciding on a database can be a daunting task, especially when you need to migrate from one database to another. In the following video, there are some pointers on what important steps you should take and what decisions you need to make.

Let’s review what this video covered:

Before switching to a new database, make sure it actually is necessary, and you have exhausted all possibilities with your current database. If you are sure you really need to decide on a new database, you can actually start with the vetting process. Start by deciding on your first candidates, based on their advantages. Thoroughly read their manuals and documentations, as well as things like GitHub Issues to get a grasp on the limitations on disadvantages of the respective databases. If there are any databases with limitations that are incompatible to your specifications you can exclude them as a candidate. Before finally deciding on your database, you should benchmark test all candidates to see which one performs best. The P99 value (the 1% slowest request processing time of the database during the benchmark) is a good indicator for that performance.

Excursion: how to store passwords securely in databases

If you ever come in the situation, where you want to create an application with user authentication, you might need to save user passwords. That we can’t do that in plain text is fairly obvious, but how would one go about to do that? The following Video explains this fairly simply and clearly. The main takeaways should be what hashing and salting is and how they allow for secure storage of a password. Please be aware that this is merely one of many ways to securely store passwords.

1.4 Databases

Types of databases