Understanding Stream Processing And Apache Kafka
Torrenting, as it is referred to when watching shows, involves downloading the entire mp4 file and watching it locally. In contrast, streaming means that you are watching the show as the packets arrive. Stream processing is then the act of processing a continuous flow of incoming data.
Suppose we were in charge of a social media platform. In the traditional approach, we’d store all of our user data in some kind of a persistent store such as a relational database. Whenever one of the users updates their status, the value in the database is overwritten. Event sourcing says that this isn’t a good way to store data because we might want to use the previous value for business analytics. Instead, we should individually record every change that happens to the database. Referring back to our example, every time the user changed their status, we’d create a separate entry, instead of overwriting the previous value. Whenever the database receives a request to read the data, we return the most up to date value.
Raw events are in a form that’s ideal for writes since you don’t need to search for the entry and update it. Rather, you can append the event to the end of a log. On the other hand, the aggregated data is in the form that’s ideal for reads. If a user is looking at someone’s status, they’re not interested in the entire history of modifications that led to the current state. Therefore, it makes sense to separate the way you write to a system from the way you read from it. We’ll get back to this.
Now, imagine that a user tries to view their friends list. In order to present this information to the user, we’d have to perform a read query on the database. Say the number of active users went from 10,000 to 100,000. This means that we have to store the information corresponding to every one of those additional 90,000 users into our database. Some very clever people have come up with search algorithms that can achieve performance on the order of O(log n). However, the time taken to search for some value still goes up as the number of entries increases. Therefore, achieving a consistently low latency as you scale, requires that you make use of caches. A cache will only hold a subset of the total data. Thus, queries take less time since they don’t have to traverse as many elements.
Caching and other forms of redundant data such as indexing are often essential for achieving good performance on reads. However, keeping the data synchronized between the different systems becomes a real challenge. In the dual writes approach, it’s the responsibility of the application code to update the data in all the appropriate places.
Suppose that we have an application with which users can send each other messages. When a new message is sent, we want to do two things:
- Add the message to the user’s inbox
- Increment the user’s count of unread messages
We keep a separate counter because we display it in the user interface all the time and it would be too slow to query the number of unread messages by scanning over the list of message every time you need to display the number. The count is derived from the actual messages in the inbox. Therefore, whenever the number of unread messages changes, we need to update the counter accordingly.
Suppose we had the following scenario:
The client sends a message to another user, inserting it into the recipient’s inbox. Then, the client makes a request to increment the unread counter. However, just at that moment, something goes wrong, maybe the database goes down, or a process crashes, or the network is interrupted. Regardless of the reason, the update to the unread counter fails.
Now, our database is inconsistent. We wouldn’t necessarily know which counters were inconsistent with what inboxes. Therefore, the only way to fix it would be to periodically recompute the values which can be very costly if we have a lot of users.
Even if everything was operating smoothly, we can still end up with problems like race conditions.
Suppose we had the following scenario:
The value of x in the first datastore is set to y by the first client and then set to z by the second client.
In the second datastore, the requests arrive in a different order, the value is first set to z then y.
Now, the two datastores are inconsistent once again.
Distributed Streaming Platform
Often times we want to use the application data for some other purpose such as training machine learning models or visualizing trends. However, we can’t just go and read values from the database because that would take up resources that could otherwise be spent handling user requests for data. As a result, we must replicate the data elsewhere which inevitably leads to the same issues discussed earlier.
How then do we get a copy of the same data in several different systems and keep them consistently synchronized as the data changes? There are multiple possible solutions but in this article, we’ll cover the distributed streaming platform Apache Kafka.
In most applications, logs are an implementation detail hidden from the user. For example, most relational databases use a write ahead log. In essence, any time the database must make changes to the underlying data, it appends those changes to the write ahead log. In the event, the database crashes while performing a write operation, it can reset to a known state.
In Kafka, the log is made a first class citizen. In other words, the main storage mechanism is the log itself. In order to make Kafka horizontally scalable, the log is split into partitions. Each partition is stored on disk and replicated across several machines, so that it can tolerate machine failures without data loss.
Referring back to our example of the social media application, any changes made by user would be sent to the backend server. The server can then append those changes as events to the log. The databases, indexes, caches and all other storage systems can be constructed by reading sequentially from the log.
Recall how, in the context of event sourcing, logs are ideal for writes because they only have to append the data to the end and databases are ideal for reads because they only contain the current state.
Kafka is able to address the problem of race conditions because it enforces strict ordering of events. Every consumer starts at the beginning of the log and reads every event, updating its contents until it is caught up. In the event the database crashes, it is able to pick up where it left off and continue to read from the log until it is caught up.
Stream processing refers to how we handle new incoming data. Kafka is a open source distributed streaming platform that provides a mechanism for reliably getting data across different storage systems.