Sunday, April 2, 2017

Java8 streams and lambda expression GroupBy MaxBy MinBy Example

Wednesday, March 22, 2017

How Hadoop works

Hadoop divides the given input file into small parts to increase parallel processing. It uses its own file system called HDFS. Each spitted file is assigned to the mapper which works on the same physical machine with the given chunk. 

Mappers are processing small file chunks and passing their processing results to context.Mappers are processing splitted files (each chunk {piece of the main file} size = HDFS block size) line by line in the map function .

Hadoop supports different programming languages so it uses its own serilization/deseriliazation mechanism. That why you see IntWritable, LongWritable,etc types in the examples. You can write your own Writable classess by implementing the Writable interface according to your requirements.

Hadoop collects all different outputs of the mappers and sort them by KEY and forwards these results to Reducers.

"Book says all values with same key will go to same reducer"

map (Key inputKey, Value inputValue, Key outputKey, Value outputValue)

reduce (Key inputKeyFromMapper, Value inputValueFromMapper, Key outputKey, Value output value)

Hadoop calls reduce function for the each line of given file.

And finally writes the result of reducers to the HDFS file system.

See the WordCount example for better understanding : hadoop-wordcount-example

Friday, March 10, 2017

Kafka Basics, Producer, Consumer, Partitions, Topic, Offset, Messages

Kafka is a distributed system that runs on a cluster with many computers. Producers are the programs that feeds kafka brokers. A broker is a kafka server which stores/keeps/maintains incoming messages in files with offsets. Each message is stored in a file with an index , actually this index is an offset. Consumers are the programs which consumes the given data with offsets. Kafka does not deletes consumed messages with its default settings. Messages stays still persistent for 7 days with default configuration of Kafka.

Topics are the contracts between the producer and consumers. Producer pushes messages to a specific topic and consumers consumes messages from that specific topic. There can be many different topics and Consumer Groups which are consuming data from a specific topic according to your business requirements.

If topic is very large (larger than the storage capacity of a computer) then kafka can use other computers as a cluster node to store that large topic. Those nodes are called as partition in kafka terminology. So each topic may be divided in many partitions (computers) in the Kafka Cluster.

So how kafka make decisions on storing and dividing large topics in to the partitions ? Answer : It does not !

So you should make calculations according to your producer/consumer load, message size and storage capacity. This concept is called as partitioning. Kafka expect you to implement Partitioner interface. Of course there is a default partitioner (SimplePartitioner implements Partitioner).

As a result; A topic is divided into many partitions and each partition has its own OffSet and each consumer reads from different partitions but many consumers can not read from same partition to prevent duplicated reads.

If you need to collect specific messages on the specific partition you should use message keys. Same keys are always collected in the same partition. By the way those messages are consumed by the same Consumer.

Topic partitioning is the key for parallelism in Kafka. On both the producer and the broker side, writes to different partitions can be done fully in parallel. So it is important to have much partitions for a better Performance !

Please see the below use case which shows semantics of a kafka server for a  game application's events for some specifics business requirements. All the messages related to a specific game player is collected on the particular partition. UserIds are used as message keys here.

I tried to briefly summarize it... I am planning to add more post about  data pipelining with: 
 Hadoop->Kafka->Spark | Storm  etc..


Images are taken from: Learning Journal and Cloudera 

Monday, March 6, 2017

Scalable System Design Patterns

1) Load Balancer : 

2) Scatter and Gather: 

3) Result Cache :

4) Shared Space

5) Pipe and Filter

6) Map Reduce

7)Bulk Synchronous Parellel

8)Execution Orchestrator

Friday, March 3, 2017


Client Libraries:

Data Types

1) Strings : BinarySafe: MAX 512 MB, you can even save images etc..
2) Set : Unique list of elements. Useful when operation on two different key with : intersect, union, difference

3) SortedSet

4) List
5) Hash : Useful for storing objects with fields. name, surname etc...

Pipelining : Less blocking client code can be achived by means of it : Send multiple request and get one answer as result. Consider: split them to multiple pipelines to reduce memory requirements on server side.using pipelining Redis running on an average Linux system can deliver even 500k requests per second.Is not it enough ? 

Pub/Sub: implements