FIT5202 – Data processing for Big Data

SOLUTION AT Australian Expert Writers

FIT5202 – Data processing for Big Data
Assignment 2B: Using real-time streaming data to predict
pedestrian traffic
Background
MelbourneGig is a start-up incubated in Monash University to provide services to the
performers in the Music & Entertainment industry. The team would like to hire us as the
Analytics Engineer to analyse the pedestrian count open data from the City of Melbourne
using big data tools, develop machine learning models to predict the pedestrian traffic, and
integrate the models into the streaming platform using Apache Kafka and Apache Spark
Streaming to perform prediction, in order to inform the street art performers.
In part A of the assignment, we have already developed the machine learning models. In
this part B , we would need to create proof-of-concept streaming applications to
demonstrate the integration of machine learning model, Kafka and Spark streaming and
create visualisation to recommend busy locations for street art performing.
Available files in Moodle:
– Three files
– Streaming_Pedestrian_December_counts_per_hour.csv
– Pedestrian_Counting_System_-_Sensor_Locations.csv
– count_estimation_pipeline_model.zip
– A Metadata file is included which contains the information about the dataset.
– These files are available in Moodle under Assessment 2B data folder
Information on Files
Two CSV data files are provided, which include the latest December hourly count data, and
the corresponding sensor locations. The streaming data file is an extract from the latest
pedestrian hourly count data from the council of City of Melbourne for the month of
December 2020. The original data is available on the website
https://data.melbourne.vic.gov.au/ .
The provided model “count_estimation_pipeline_model” is a simplified version to predict the
hourly count given the input of sensor ID, week of the year, day of the month, day of the
week, time, and previous day’s hourly count at the same hour.
– The provided model requires the following columns in the designated formats –
“Sensor_ID”: int, “next_day_week”: int, “next_Mdate”: int, “next_day_of_week”: int,
“Time”: int, “prev_count”: int. Make sure you include these columns before prediction.
– To use the model, please unzip the zip file and the resulting folder
“count_estimation_pipeline_model” should contain two subfolders – “metadata” and
“stages”.
– You can put the “count_estimation_pipeline_model” folder in the same directory of
your notebook before loading the model into the Spark
What you need to achieve
The MelbourneGig company requires a proof-of-concept application for ingesting the new
count data and predicting the pedestrian count data. To achieve this, you need to simulate
the streaming data production using Kafka, and then build a streaming application that
ingests the data and integrates the machine learning model (provided to you) to the
pedestrian count to predict whether the location would have more than 2000 visitors so that
it can be used for recommending busy locations for street art performing.
A compulsory interview would also be arranged in Week 6 after the submission to
discuss your proof-of-concept application.
Architecture
The overall architecture of the assignment setup is represented by the following figure.
Fig 1: Overall architecture for assignment 2 (part B components updated)
In Part B of the assignment 2, you have three main tasks – producing streaming data,
processing the streaming data, and visualising the data.
1. In task 1 for producing the streaming counts for December 2020, you can use csv
module or Pandas library or other libraries to read and publish the data to the Kafka
stream.
2. In task 3 for streaming data application, you need to use Spark Structured Streaming
together with PySpark ML / DataFrame to process the data streams.
3. For task 3, you can use csv module or Pandas library or other libraries to read the
data from the Kafka stream and visualise it.
Please follow the steps to document the processes and write the codes in Jupyter Notebook.
Getting Started
● Download the data and models from moodle.
● Create an Assignment-2B-Task1_producer.ipynb file for data production
● Create an Assignment-2B-Task2_spark_streaming.ipynb file for consuming and
processing data using Spark Structured Streaming
● Create an Assignment-2B-Task3_consumer.ipynb file for consuming the count
data using Kafka
1. Producing the data (10%)
In this task, we will implement one Apache Kafka producer to simulate the real-time data
transfer from one repository to another.
Important :
– Do not use Spark in this task
Your program should send one batch of all sensor’s one day worth of records every 5
seconds to the Kafka stream.
– For example, for the first batch of data transmission, your program should send all
sensors’ data captured on 2020-12-01 to the Kafka stream; after 5 seconds, your
program should send all sensors’ data captured on 2020-12-02 to the Kafka stream,
and so on.
All the data should be sent in original String format, without changing to any datetime format.
Save your code in Assignment-2B-Task1_producer.ipynb .
2. Streaming application using Spark Structured Streaming (55%)
In this task, we will implement Spark Structured Streaming to consume the data from task 1
and perform predictive analytics.
Important :
– In this task, use PySpark Structured Streaming together with PySpark
Dataframe APIs and PySpark ML
– You are also provided with a pre-trained pipeline model for predicting the
hourly counts between 9am-11:59pm on the next day and persist the
prediction. Information on the required inputs of the pipeline model can be
found in the Background section.
1. Write code to SparkSession is created using a SparkConf object, which would use
two local cores with a proper application name, and use UTC as the timezone
2. Similar to assignment 2A, write code to define the data schema for the sensor
location CSV file, following the data types suggested in the metadata file 1 , with the
exception of the “location” columns; then load the CSV file into a dataframe using the
schema.
a. Use StringType for “location” column
3. Using the same topic name from the Kafka producers in Task 1, ingest the streaming
data into Spark Streaming assuming all data coming in String format
4. Persist the raw streaming data in parquet format
5. Then the streaming data format should be transformed into the proper formats
following the metadata file schema, similar to assignment 2A
6. As the purpose of the recommendation is to predict the next day’s pedestrian count
between 9am-11:59pm , write code to perform the following transformations to
prepare the columns for model prediction.
a. Create a date format column named “next_date” which represents the next
calendar date of “Date_Time”
b. Create the column named “next_Mdate” based on the column “next_date” to
include day of the month information
c. Create the column named “next_day_week” based on the column “next_date”
to include week of the year information
d. Create the column named “next_day_of_week” based on the column
“next_date” to include day of the week information, assuming Monday being
the first day of week
e. Rename the column “Hourly_Count” as “prev_count”
1 In this assignment, the “installation_date” should be directly read as Date format, instead of reading
as String in assignment1.
7. Load the machine learning models given, and use the model to predict the next day’s
pedestrian count between 9am-11:59pm . Persist the prediction result in parquet
format.
8. Using the prediction result, and write code to process the data following the
requirements below
a. For each sensor, get the number of hours that the predicted pedestrian count
would exceed 2000 on each day (for the hours between 9am-11:59pm). Show
them inside the notebook file.
i. Hint – you might want to consider the window size for each day .
b. If any sensor’s hourly count between 9am-11:59pm on the next day is
predicted to exceed 2000, combine the result with sensor longitude and
latitude information, and write the stream back to Kafka sink using a different
topic name
i. Hint – you might want to construct key and value columns before
writing the stream back to Kafka.
Save your code in Assignment-2B-Task2_spark_streaming. ipynb .
3. Consuming data using Kafka (15%)
In this task, we will implement an Apache Kafka consumer to consume the data from task
2.7a.
Important :
– In this task, use Kafka consumer to consume the streaming data published
from task 2.8b.
– Do not use Spark in this task
Your program should consume the latest data and display the sensor locations on a map.
The map should be refreshed whenever a new batch of data is being consumed.
– Hint – you can use libraries like plotly or folium to show data on a map, please also
provide instructions on how to install your plotting library.
Save your code in Assignment-2B-Task3_consumer.ipynb .
Assignment Marking
The marking of this assignment is based on quality of work that you have submitted rather
than just quantity. The marking starts from zero and goes up based on the tasks you have
successfully completed and it’s quality for example how well the code submitted follows
programming standards, code documentation, presentation of the assignment, readability of
the code, reusability of the code, organisation of code and so on . Please find the PEP 8 —
Style Guide for Python Code here for your reference.
An interview would also be required for demonstrating your knowledge and understanding
on the assignment. The interview would be run during week 12 lab, where audio + camera
connection is required.
Submission
You should submit your final version of the assignment solution online via Moodle; You must
submit the following:
● A PDF file containing all codes from the 1st, 2nd, 3rd notebook to be submitted
through Turnitin submission link
○ Use the browser’s print function (NOT Microsoft print) to save each notebook
as PDF,
● A zip file named based on your authcate name (e.g. psan002). And the zip file should
contain
○ Assignment-2B-Task1_producer .ipynb
○ Assignment-2B-Task2_spark_streaming .ipynb
○ Assignment-2B-Task3_consumer.ipynb
This should be a ZIP file and not any other kind of compressed folder (e.g.
.rar, .7zip, .tar) . Please do not include the data files in the ZIP file.

Order from Australian Expert Writers
Best Australian Academic Writers

QUALITY: 100% ORIGINAL PAPERNO PLAGIARISM – CUSTOM PAPER