Have you ever wanted to learn more about Databases but did not know where to start? This is a book just for you.

We can treat databases and other infrastructure components as black boxes, but it doesn’t have to be that way. Sometimes we have to take a closer look at what’s going on because of performance issues. Sometimes databases misbehave, and we need to find out what exactly is going on. Some of us want to work in infrastructure and develop databases. This book’s main intention is to introduce you to the cornerstone concepts and help you understand how databases work.

The book consists of two parts: Storage Engines and Distributed Systems since that’s where most of the differences between the vast majority of databases is coming from.

In Storage Engines, we start with taxonomy and terminology, then explore In-Place Update storage and discuss several B-Tree variants and their structure. Then we talk about binary data formats and file organization and explore the ways to compose efficient on-disk structures. After that, we go into the detail on what techniques different databases use when implementing B-Trees and talk about related data structures such as Page Buffer, Write-Ahead Log, how to implement compression and perform defragmentation and compaction. Finally, we discuss Log-Structured storage and explore a few different storage engine approaches, such as Bw-Trees, FD-Trees, CoW B-Tress, Bitcask, WiscKey, 2/3 Component LSM, and some other ones.

In Distributed Systems, we start with basic concepts such as processes and links and start building more complex communication patterns. We quickly discover that communication is unreliable and discuss which guarantees we have and how to achieve those. We cover the Important concepts such as Failure Detection, Leader Election and Gossip Dissemination. After that, we explore different Consistency Models and talk about ways to achieve them. After covering Atomic Commitment and Broadcast, we move to the pinnacle of Distributed Systems research: Consensus Algorithms.

This book includes references to 100+ papers, 10+ books several open source database implementations and other sources you can refer to for further study.

What’s Inside

Taxonomy and Terminology

We discuss the precise definitions, use-cases, applications and differences between the existing databases and storage engines sorts and classes: Column vs Row Oriented Stores, Memory and Disk based databases, In-Place Update and Immutable storage engines.

In-Place Update Storage Engines

Many modern databases such as PostgeSQL, MySQL and many others implement variants of the mutable in-place update data structure: B-Tree. We’ll discuss its origins, binary on-disk layout, organisation and popular variants such as Blink-Trees, B*-Trees, Copy-On-Write B-Trees and many others.

Auxiliary Structures

Storage Engines consists of a primary storage data structure and several auxiliary subsystems that take care of garbage collection, maintenance, compression. Many modern databases use Write-Ahead Log for restore and recovery and implement buffer management in a form of Page Cache.

Log-Structured Storage

With advent of SSDs, we’ve seen many databases implementing and using Log-Structured storage. We’ll explore the whole spectrum of immutable data structures, ranging from B-Tree like LSM-Trees and Bw-Trees to unsorted variants such as LLAMA, Bitcask, WiscKey.

Problems with Distributed Systems

How are the Distributed Systems different from the single-node ones? What is FLP Impossibility and Two-Generals problems. How network and communication using message passing puts limits on what we can and not do and how we can build reliable systems despite these complications.

Consistency Models

In a replicated systems, where we have multiple copies of data, we have to make to keep nodes in sync to return consistent results. We talk about concepts of Linearizability, Serializability, Eventual and Causal Consistency, their guarantees and limitations.

Leader Election and Failure Detection

Many distributed databases use a concept of Leadership to have a single point of reference and make some of the decisions locally. However, both the leader and participant may fail. We explore several Failure Detection algorithms that help us to detect these failures and react to them.

Broadcast and Consensus

With Atomic Commitment, Total Broadcast and Consensus algorithms, distributed systems can make cluster-wide decisions and communicate them to the participants preserving strong consistency guarantees. We discuss both traditional and cutting-edge algorithms used for that.

About the Author

Alex is an Infrastructure Engineer, Apache Cassandra Committer and PMC Member, working on building data infrastructure and processing pipelines. He’s interested in algorithms, Databases, Distributed Systems, understanding how things work and sharing it with others through blog posts, articles and conference talks.

Feedback, Reviews and Testimonies

I often recommend Database Internals: A Deep Dive into How Distributed Data Systems Work

by Marc Brooker, Distinguished engineer at AWS, via Twitter

I’m a big fan of @ifesdjeen book. Read it a couple of times.

by Alex Xu, author of System Design Interview via Twitter

“Designing Data Intensive Applications” gives a broad but detailed survey. “Database Internals” goes deeper on specifically the building a distributed database side.

by Alex Miller via Twitter

I finally got round to reading @ifesdjeen's Database Internals. It's a nice book! If you enjoyed @intensivedata and want more detail on certain topics (especially storage engines and consensus), it's a good follow-on read. https://databass.dev

by Martin Kleppmann, author of DDIA via Twitter

Enjoying my copy of Database Internals in NYC before heading home... (Thanks ⁦@ifesdjeen ⁩ for your most valuable signature!)

by Joran Dirk Greef, Founder and CEO of @TigerBeetleDB via Twitter

BTW if you are looking for a book on databases that you can read cover to cover, I recommend Database Internals by @ifesdjeen

by Dominik Tornow, author of Thinking in Distributed Systems via Twitter

Reviewing the Chinese version of Database Internals, so happy to offer my help on bringing this book to Chinese readers, I think it will be available in bookstores soon. @ifesdjeen

by Ed Huang, Co-founder & CTO of @PingCAP via Twitter

If you're not a stranger to the world of databases, then you have either read or heard about the Database Internals book by Alex Petrov.

by Denis Magda, author of Just use Postgres via Twitter

Diving again in some sections of @ifesdjeen's great "Database Internals" on this rainy week-end. I hope material like this makes it into classrooms at some point.

by Pierre-Yves Ritschard via Twitter

Honestly for my next book If I get it to be half as good as @ifesdjeen’s book, it’d be a win already. The work and effort shows on every page. Graphics, diagrams, correct jargon, and so on. A Distributed Systems gold mine!

by Alvaro Videla via Twitter

Finished reading @therealdatabass by @ifesdjeen on database internals. This is an unusually in-depth and precise book on data structures and algorithms from a non-academic publisher, but extremely readable and compact. I feel like I learned a lot here efficiently.

by Chris Seaton via Twitter

Reading "Database Internals" #therealdatabass and find it awesome! Strongly recommend for those whoever wants to understand how database works. Thanks @ifesdjeen for mentioning #TiDB in the "Distributed Transaction" chapter.

by Siddon Tang via Twitter

Got my copy of @therealdatabass by @ifesdjeen, started reading it online. Great book for anyone working with databases or distributed systems. Such an essential part of modern software engineering.

by Travis Sturzl via Twitter

A must on everyone's bookshelf.

by Jordi Martinez via Twitter

Have been enjoying Database Internals @therealdatabass. It has helped me see many concepts I’ve encountered here and there as a software engineer in the past years coming together and making sense.

by Oak Nauhygon via Twitter

Database Internals is fantastic for a solid framework for understanding databases.

by Alex Wise via Twitter

This is one of the best texts covering Database internals. Databases are used everyday, and understanding what happens under the hood is daunting task. This book takes a pragmatic approach on the topic, starting with basics and then taking a deeper dive into how the basic data structures and concept come together. IMHO, this book shall appeal to both Database developer's and engineer's who want to understand how databases work. This book is must have to for the engineer's who really want to get into Database development. Otherwise also this book is a must have reference in general. I personally liked the attention to details in the book on what really matter's when writing a real database. The concepts are equally applicable to SQL and NoSQL databases.

by Ashish Paliwal via Amazon review

This is an amazing book about the internal workings of a database. I highly recommend this book if you’re looking to understand how a database actually works. Building a full-featured database is a huge undertaking, but after reading this you should be able to understand how most major databases work and even build your own. It also has a great section that goes over distributed systems.

This is the only book I know of that has all of this information relevant to database design all in one place. As someone who has read a lot of the resources listed in this book (there are a ton!), it’s nice to see all of this information condensed into a single book.

by J.R. Garcia via Amazon review

I have a number of years in the IT industry (focused on security) and I always wanted to get my hands on something that would explain in depth the inner workings of databases. This book hits the nail on its head for me. It is exactly what I was looking for.

by Alex Stamate via Amazon review

Nice book to read if you want to understand how database systems work underneath. If you are a software developer/database administrator you should read this to get a mental model of physical workings (opposed to logical e.g, SQL) of a database.

by Sasi Kanth Ala via Amazon review

This book is a must read for anyone wanting to call themselves full-stack engineers.

So much of this knowledge of database internals is becoming basic required knowledge when designing distributed/microservices/event-driven platforms.

by Mck via Amazon review

Such great details on both local disk storage and distributed systems. This is a must read for anyone who uses database; really helps explain how they work

by Anonymous person via Amazon review

FAQ

Where do I get a DRM-free version of the book?

Most definitely, just as any O’Reilly book, you can get it from ebooks.com.

Is this book about NoSQL/Distributed or Traditional/Relational Databases?

This book does not dissect any specific database. Instead, it takes several of them apart to understand what’s inside. B-Trees can be used both in relational databases, say, PostgreSQL and in document databases such as, for example MongoDB (WiredTiger). Similarly, there was an attempt to add LSM Trees to SQLite, while it’s used in Apache Cassandra.

Distributed systems concepts such as Two-Phase commit, Gossip, Leader Election and Failure Detection are not specific to NoSQL movement and can be (and are) used in many databases. Moreover, we witness a new generation of databases working at scale while offering rich query API and strong (or configurable) consistency guarantees. The book is about concepts that are seen in databases, all kinds of databases.