Kafka For Beginners (Part 1)
In this post “Kafka For Beginners” you can find detailed information about Kafka.
What is Kafka ?
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue designed as a distributed transaction log,” making it highly valuable for enterprise infrastructures to process streaming data. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.
Moreover, this technology replaces the conventional message brokers, with the ability to give higher throughput, reliability, and replication like JMS, AMQP etc. Additionally, core abstraction Kafka offers a Kafka broker, a Kafka Producer, and a Kafka Consumer.
Kafka broker is a node on the Kafka cluster, its use is to persist and replicate the data and Kafka Producer pushes the message into the message container called the Kafka Topic. And Kafka Consumer pulls the message from the Kafka Topic.
Messaging System in Kafka
We use the Messaging System for When we transfer data from one application to another. Applications can focus on data only, without worrying about how to share data. On the concept of reliable message queuing, distributed messaging is based. Although, messages are asynchronously queued between client applications and messaging system. There are two types of messaging patterns available, i.e. point to point and publish-subscribe (pub-sub) messaging system. However, most of the messaging patterns follow pub-sub.
- Point to Point Messaging System
Here, messages are persisted in a queue. Although, a particular message can be consumed by a maximum of one consumer only, even if one or more consumers can consume the messages in the queue. Also, it makes sure that as soon as a consumer reads a message in the queue, it disappears from that queue.
- Publish-Subscribe Messaging System
Here, messages are persisted in a topic. In Kafka system, Kafka Consumers can subscribe to one or more topic and consume all messages in that topic and message producers refer publishers and message consumers are subscribers here.
Why use Apache Kafka
Following are a few benefits of Kafka −
- Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
- Scalability − Kafka messaging system scales easily without down time..
- Durability − Kafka uses
Distributed commit logwhich means messages persists on disk as fast as possible, hence it is durable..
- Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.
- Kafka is very fast and guarantees zero downtime and zero data loss.
Kafka can be used in many Use Cases. Some of them are listed below −
- Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
- Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers.
- Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
You may like also : InfluxDB | Time Series Database ? | TickStack | Tickscript ?
Kafka Tutorial — Prerequisites
You must have a good understanding of Java, Scala, Distributed messaging system, and Linux environment, before proceeding with this Apache Kafka Tutorial.
A streaming platform has three key capabilities:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up.
First a few concepts:
- Apache Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Apache Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Components are :
Basically, how Kafka stores and organizes messages across its system and essentially a collection of messages are Topics. In addition, we can replicate and partition Topics. Here, replicate refers to copies and partition refers to the division. Also, visualize them as logs wherein, Kafka stores messages. However, this ability to replicate and partitioning topics is one of the factors that enable Kafka’s fault tolerance and scalability.
It publishes messages to a Kafka topic.
Subscribes to a topic(s), reads and processes messages from the topic(s).
Manages the storage of messages in the topic(s). If Kafka has more than one broker, that is what we call a Kafka cluster.
To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.
We view log as the partitions in this Kafka tutorial. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select. Here, the below diagram shows a log is being written by the data source and the log is being read by consumers at different offsets.
By Kafka, messages are retained for a considerable amount of time. Also, consumers can read as per their convenience. However, if Kafka is configured to keep messages for 24 hours and a consumer is down for time greater than 24 hours, the consumer will lose messages. And, messages can be read from last known offset, if the downtime on part of the consumer is just 60 minutes. Kafka doesn’t keep state on what consumers are reading from a topic.
Partition in Kafka
There are few partitions in every Kafka broker. Moreover, each partition can be either a leader or a replica of a topic. In addition, along with updating of replicas with new data, Leader is responsible for all writes and reads to a topic. The replica takes over as the new leader if somehow the leader fails.
In Apache Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. Apache provide a Java client for Kafka, but clients are available in many languages.
I hope you like this post. Do you have any questions? Leave a comment down below!Thanks for reading. If you like this post probably you might like my next ones, so please support me by subscribing my blog.
You may like also : InfluxDB | Time Series Database ? | TickStack | Tickscript ?