Introducing KitKat Series: A Hub to practice Big Data Projects
Are you Interested to work on Big Data Projects ? But, don’t have sufficient resources and confused where to start from ? If yes, then read through and you might find answers.
Many a times, people feel helpless for not getting right opportunities to work on technologies they aspire to, although they have the right skill-set. Upcoming blog series “KitKat” is my sincere attempt to address such group of people. Ok, Ok… I know, you have many questions roaming around in your mind while reading this, and to address this, I have written this “Introductory” blog piece. It’s good to set the expectations right at the beginning, hence I will jot down common FAQ’s such as What to expect and what not, what are the pre-requisites, etcetera . This would mentally prepare you before jumping in to the KitKat.
What is KitKat series?🙄
- It is a series of technical blogs (published in Medium) containing mini Big data Hadoop projects from various domains such as Healthcare, Telecom, Government, Finance, etc.
Why the name KitKat?🧐
- It signifies Short, and sweet 🍫. No rocket science, I wanted the name to be relatable with its content.
What to Expect?🤑
- First, Any project(s) start for a reason, it could be Solving a problem, enhancing existing process, blah-blah.. and to make better business decisions.
- Hence, All my blogs will start with problem statement or a goal-statement to enable business decisions.
- We will then draw out the available sources and its origin and connect with our data pipeline and decide which component of Hadoop eco-system to be used and where.
- Initiate the data pipeline Hands-on. Steps for each task, would be briefed and the source code will also be available in my GitHub repo💻. mihirdhakan93 (Mihir Dhakan) (github.com)
- Follow the steps mentioned in blog or GitHub Repo and perform on your own.
What not to expect?😊
- Let’s be very honest here, It would not be as full-fledged as in the industry, but you would be able to figure out the different aspects of data pipelines and how it works.
- Willingness to learn and get hands dirty. Just by reading the blog, you won’t learn —indeed, bitter truth.
- Beginner knowledge of Hadoop, hive, hdfs, scala, python, sqoop, mysql, etc. If you don’t fall under this, Don’t worry, it’s not going to be that tough, give a try mate; better ‘try’ than never.
- Data — I would be sharing some open-source Data repo’s in each blog which we will use throughout.
- Cluster — I would be using Big Data Cloud Cluster provided by some providers such as CloudxLab for nominal charges. I am big fan of cloud, but you are free to use it in your local, virtual machines, etc.
P.S : Idea is to club all the technologies in one single project data pipeline, and not learn things in Silo mode.
Which components of Hadoop will I learn hands-on ?😍
- Hive, HDFS, Kafka, Spark, Scala, Python (libs such as Pandas), MySQL, File storage methods such as Avro, Parquet, ORC
Ok, I am ready — Let’s Start
- Follow me ( I mean, Follow on Medium 😉) and subscribe to get a notification.