Introducing KitKat Series: A Hub to practice Big Data Projects

Mihir Dhakan
3 min readJun 19, 2021

Are you Interested to work on Big Data Projects ? But, don’t have sufficient resources and confused where to start from ? If yes, then read through and you might find answers.

Image: Introductory image of Big Data

Many a times, people feel helpless for not getting right opportunities to work on technologies they aspire to, although they have the right skill-set. Upcoming blog series “KitKat” is my sincere attempt to address such group of people. Ok, Ok… I know, you have many questions roaming around in your mind while reading this, and to address this, I have written this “Introductory” blog piece. It’s good to set the expectations right at the beginning, hence I will jot down common FAQ’s such as What to expect and what not, what are the pre-requisites, etcetera . This would mentally prepare you before jumping in to the KitKat.

What is KitKat series?🙄

  • It is a series of technical blogs (published in Medium) containing mini Big data Hadoop projects from various domains such as Healthcare, Telecom, Government, Finance, etc.

Why the name KitKat?🧐

  • It signifies Short, and sweet 🍫. No rocket science, I wanted the name to be relatable with its content.

What to Expect?🤑

  • First, Any project(s) start for a reason, it could be Solving a problem, enhancing existing process, blah-blah.. and to make better business decisions.
  • Hence, All my blogs will start with problem statement or a goal-statement to enable business decisions.
  • We will then draw out the available sources and its origin and connect with our data pipeline and decide which component of Hadoop eco-system to be used and where.
  • Initiate the data pipeline Hands-on. Steps for each task, would be briefed and the source code will also be available in my GitHub repo💻. mihirdhakan93 (Mihir Dhakan) (github.com)
  • Follow the steps mentioned in blog or GitHub Repo and perform on your own.

What not to expect?😊

  • Let’s be very honest here, It would not be as full-fledged as in the industry, but you would be able to figure out the different aspects of data pipelines and how it works.

Pre-requisites🦾

  • Willingness to learn and get hands dirty. Just by reading the blog, you won’t learn —indeed, bitter truth.
  • Beginner knowledge of Hadoop, hive, hdfs, scala, python, sqoop, mysql, etc. If you don’t fall under this, Don’t worry, it’s not going to be that tough, give a try mate; better ‘try’ than never.
  • Data — I would be sharing some open-source Data repo’s in each blog which we will use throughout.
  • Cluster — I would be using Big Data Cloud Cluster provided by some providers such as CloudxLab for nominal charges. I am big fan of cloud, but you are free to use it in your local, virtual machines, etc.

P.S : Idea is to club all the technologies in one single project data pipeline, and not learn things in Silo mode.

Which components of Hadoop will I learn hands-on ?😍

  • Hive, HDFS, Kafka, Spark, Scala, Python (libs such as Pandas), MySQL, File storage methods such as Avro, Parquet, ORC

Ok, I am ready — Let’s Start

  • Follow me ( I mean, Follow on Medium 😉) and subscribe to get a notification.

--

--