Designing Data-Intensive Applications

by Martin Kleppmann

30 popular highlights from this book

Key Insights & Memorable Quotes

Below are the most popular and impactful highlights and quotes from Designing Data-Intensive Applications:(Showing 30 of 30)

“In distributed systems, suspicion, pessimism, and paranoia pay off.”

“The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise.”

“data outlives code.”

“A database is just a tool: how you use it is up to you.”

“The fastest and most reliable network request is no network request at all!”

“The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.”

“Pop culture is all about identity and feeling like you’re participating. It has nothing to do with cooperation, the past or the future — it’s living in the present. I think the same is true of most people who write code for money.”

“Technology is a powerful force in our society. Data, software, and communication can be used for bad: to entrench unfair power structures, to undermine human rights, and to protect vested interests. But they can also be used for good: to make underrepresented people’s voices heard, to create opportunities for everyone, and to avert disasters. This book is dedicated to everyone working toward the good.”

“Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford InfoLab Technical Report 422, 1999.”

“To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question”

“Martin Kleppmann: “Rethinking Caching in Web Apps,”

“The Google File System”

“it is poor civic hygiene to install technologies that could someday facilitate a police state”

“If we want the future to be better than the past, moral imagination is required, and that’s something only humans can provide [87]. Data and models should be our tools, not our masters.”

“How do we make our systems reliable, in spite of unreliable humans? The best systems combine several approaches: Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users. Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [3]. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation. Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes, roll out new code gradually (so that any unexpected bugs affect only a small subset of users), and provide tools to recompute data (in case it turns out that the old computation was incorrect). Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures [14].) Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue. Implement good management practices and training—a complex and important aspect, and beyond the scope of this book.”

“we can understand reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.” The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.”

“If you are designing a data system or service, a lot of tricky questions arise. How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded? How do you scale to handle an increase in load? What does a good API for the service look like?”

“We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.”

“We also discussed the interaction between partitioning and secondary indexes. A secondary index also needs to be partitioned, and there are two methods: Document-partitioned indexes (local indexes), where the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/gather across all partitions. Term-partitioned indexes (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.”

“define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.”

“hidden assumptions,”

“Reliability means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user. Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load. Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it. There is unfortunately no easy fix for making applications reliable, scalable, or maintainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals.”

“Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.”

“Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [32] define complexity as accidental if it is not inherent in the problem that the software solves”

“It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.”

“The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.”

“When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements [23].”

“In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure.”

“let’s consider Twitter as an example, using data published in November 2012 [16]. Two of Twitter’s main operations are: Post tweet A user can publish a new message to their followers (4.6k requests/sec on average, over 12k requests/sec at peak). Home timeline A user can view tweets posted by the people they follow (300k requests/sec).”

“Scalability is the term we use to describe a system’s ability to cope with increased load.”

Search More Books

More Books You Might Like

Designing the Mind: The Principles of Psychitecture

by Ryan A. Bush

Desire: Vintage Minis

by Haruki Murakami

Detektiv wider Willen

by Klara & Theo