KitKat 101 : Consumer Expenditure Behavior Analysis
Welcome to KitKat 101, in this blog we will solve the problem using the data available to us, with the help of Big data components such as Hive, HDFS, Sqoop.
Firstly, If you have not seen the Introductory Blog on KitKat Series, please head to https://mihirdhakan.medium.com/introducing-kitkat-series-a-hub-to-practice-big-data-projects-aa782dbbdfb1
Else, keep scrolling 😀 and reading
Project Name : Consumer Expenditure Behavior Analysis
Difficulty Level : Beginner👶
Components used : MySQL, Sqoop, HDFS, Hive, HQL
Data Domain: Government (New Zealand)
EDI (Early Data Inventory) : Data available to us is from 2007 Jan, till 2021 May containing below information in CSV Format.
Series_reference : A 13 digit reference number based on Category of expense
Period : Year and Month (YYYY.MM) on which the transaction(s) took place.
Data_value : Transacted Amount in Dollars
Suppressed: Y/N Flag field, not of much importance
STATUS : possible values are R,F,C. not of much importance
UNITS : Currency measurement unit
Magnitude: not of much importance
Subject: Static value as “Electronic Transaction…”
Group: Static value as “Private Values…”
Series_title_1 : possible values are Adjusted,actual.
Series_title_2 : Type of Expenses such as accomodation, supermarket, etc.
Assumptions: In this Project, we have made below assumptions to simulate the data as per industry standard.
- Data is residing in MySQL Database
Business requirement: The requirement is to bring the data from MySQL to Hadoop Data Lake and pump it every month and perform the analysis of “Consumer Expenditure Behavior”. This would help to conclude the cost of living factor in New Zealand.
- Top Expense category of each year till 2021. This is to know if the expenditure pattern is changing with time, and the growth of expense increasing (%) till 2021.
2. Average expense for each category spend in each year.
3. There is no end, to this. So we will limit to 2 KPI’s 😉
Data Flow Diagram: Refer to the above image.
Let’s Get our hands dirty then. 💻
Screenshots for all the steps are in my GitHub Repo mihirdhakan93/KitKat_Projects: Big Data Mini Projects with hands on for Hadoop, HDFS, Spark, Scala, and much more (github.com)
Step 1: First, we need to fulfill our assumption that the data is residing in MySQL Database, so we will create the table and load the csv file into MySQL.
Step 2: We need to create the similar structure in Hive, we keep the table as EXTERNAL and point the path where we plan to reside the HDFS file (exported from MySQL in Step 3)
Step 3: We use Sqoop to import the data from MySQL to HDFS,
Challenge 1: Every KitKat Project will have small Challenges, in this case we do not have PK Column defined in MySQL, and without this OR Integer Column in MySQL you cannot import the Data. Though there are ways to achieve this. )
Challenge 2: There are values having (,) comma inside, and as this is CSV file, while loading to MySQL + Hive Table we need to ensure the data values are not corrupted. Hint-OpenCSV Serde
Step 4: Challenge Accepted. We Import the Table by passing the Additional config in Sqoop import command to allow text splitter and specify the column by which we need to split the data.
Step 5: Create SQL Script to achieve the KPI 1 and 2, and place it in HQL File.
Step 6: Execute the HQL file and export the results in CSV file.
Step 7: Interpret the behavior outcome of consumers based on data.
Step 8: Perform Step 1 to 7 on your own (don’t just read and leave), It really helps.
Thanks for reading through. Click on Claps 👏 if you like it !
See you soon in next blog with some exciting new KPI’s.