Introducing KitKat Series: A Hub to practice Big Data Projects

3 min readJun 19, 2021

Are you Interested to work on Big Data Projects ? But, don’t have sufficient resources and confused where to start from ? If yes, then read through and you might find answers.

Many a times, people feel helpless for not getting right opportunities to work on technologies they aspire to, although they have the right skill-set. Upcoming blog series “KitKat” is my sincere attempt to address such group of people. Ok, Ok… I know, you have many questions roaming around in your mind while reading this, and to address this, I have written this “Introductory” blog piece. It’s good to set the expectations right at the beginning, hence I will jot down common FAQ’s such as What to expect and what not, what are the pre-requisites, etcetera . This would mentally prepare you before jumping in to the KitKat.

What is KitKat series?🙄

It is a series of technical blogs (published in Medium) containing mini Big data Hadoop projects from various domains such as Healthcare, Telecom, Government, Finance, etc.

Why the name KitKat?🧐

It signifies Short, and sweet 🍫. No rocket science, I wanted the name to be relatable with its content.

What to Expect?🤑

First, Any project(s) start for a reason, it could be Solving a problem, enhancing existing process, blah-blah.. and to make better business decisions.
Hence, All my blogs will start with problem statement or a goal-statement to enable business decisions.
We will then draw out the available sources and its origin and connect with our data pipeline and decide which component of Hadoop eco-system to be used and where.
Initiate the data pipeline Hands-on. Steps for each task, would be briefed and the source code will also be available in my GitHub repo💻. mihirdhakan93 (Mihir Dhakan) (github.com)
Follow the steps mentioned in blog or GitHub Repo and perform on your own.

What not to expect?😊

Let’s be very honest here, It would not be as full-fledged as in the industry, but you would be able to figure out the different aspects of data pipelines and how it works.

Pre-requisites🦾

Willingness to learn and get hands dirty. Just by reading the blog, you won’t learn —indeed, bitter truth.
Beginner knowledge of Hadoop, hive, hdfs, scala, python, sqoop, mysql, etc. If you don’t fall under this, Don’t worry, it’s not going to be that tough, give a try mate; better ‘try’ than never.
Data — I would be sharing some open-source Data repo’s in each blog which we will use throughout.
Cluster — I would be using Big Data Cloud Cluster provided by some providers such as CloudxLab for nominal charges. I am big fan of cloud, but you are free to use it in your local, virtual machines, etc.

P.S : Idea is to club all the technologies in one single project data pipeline, and not learn things in Silo mode.

Which components of Hadoop will I learn hands-on ?😍

Hive, HDFS, Kafka, Spark, Scala, Python (libs such as Pandas), MySQL, File storage methods such as Avro, Parquet, ORC

Ok, I am ready — Let’s Start

Follow me ( I mean, Follow on Medium 😉) and subscribe to get a notification.

Introducing KitKat Series: A Hub to practice Big Data Projects

Written by Mihir Dhakan