Assignment 3 - Big Data Analytics using Apache Spark

Course: Big Data - IU S25
Author: Firas Jolha

Dataset

Nginx server access log (3.26 GB)

Agenda

Assignment 3 - Big Data Analytics using Apache Spark
Dataset
Agenda
Prerequisites
Description
Description
Instructions
- Plagiarism policy
Assignment Tasks
References

Prerequisites

Access to the university cluster

Description

This assignment provides hands-on experience in using Spark to analyze a real-world dataset and extract meaningful business insights.

Description

Web sever logs contain information on any event that was registered/logged. This contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, and more. This midterm consists of two clusters (MongoDB and Hadoop). In Hadoop cluster will be used to transform the dataset and load it to the MongoDB using PySpark. In MongoDB cluster, you will analyze the data and store the results of data analytics. The data in MongoDB should be stored in a sharded collection.

The dataset can be download from the link attached to this document. This dataset has access logs of Ngix server where each line stores the log info of a single request to the server. The typical log format in Nginx servers is as follows:

'$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent"';

You can see below the first 15 lines of this data file. You need to explore the dataset to understand its schema.















54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/61474/productModel/200x200 HTTP/1.1" 200 5379 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/14925/productModel/100x100 HTTP/1.1" 200 1696 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:17 +0330] "GET /product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%D8%B1-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR257AT HTTP/1.1" 200 41483 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0)Gecko/16.0 Firefox/16.0" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/23488/productModel/150x150 HTTP/1.1" 200 2654 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/45437/productModel/150x150 HTTP/1.1" 200 3688 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/576/article/100x100 HTTP/1.1" 200 14776 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
66.249.66.194 - - [22/Jan/2019:03:56:18 +0330] "GET /filter/b41,b665,c150%7C%D8%A8%D8%AE%D8%A7%D8%B1%D9%BE%D8%B2,p56 HTTP/1.1" 200 34277 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/57710/productModel/100x100 HTTP/1.1" 200 1695 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
207.46.13.136 - - [22/Jan/2019:03:56:18 +0330] "GET /product/10214 HTTP/1.1" 200 39677 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:19 +0330] "GET /image/578/article/100x100 HTTP/1.1" 200 9831 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
178.253.33.51 - - [22/Jan/2019:03:56:19 +0330] "GET /m/product/32574/62991/%D9%85%D8%A7%D8%B4%DB%8C%D9%86-%D8%A7%D8%B5%D9%84%D8%A7%D8%AD-%D8%B5%D9%88%D8%B1%D8%AA-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR465AT HTTP/1.1" 200 20406 "https://www.zanbil.ir/m/filter/p5767%2Ct156?name=%D9%85%D8%A7%D8%B4%DB%8C%D9%86-%D8%A7%D8%B5%D9%84%D8%A7%D8%AD&productType=electric-shavers" "Mozilla/5.0 (Linux; Android 5.1; HTC Desire 728 dual sim) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.83 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:19 +0330] "GET /image/6229/productModel/100x100 HTTP/1.1" 200 1796 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:19 +0330] "GET /product/10075/13903/%D9%85%D8%A7%DB%8C%DA%A9%D8%B1%D9%88%D9%81%D8%B1-%D8%B1%D9%88%D9%85%DB%8C%D8%B2%DB%8C-%D8%B3%D8%A7%D9%85%D8%B3%D9%88%D9%86%DA%AF-%D9%85%D8%AF%D9%84-CE288 HTTP/1.1" 200 41725 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36" "-"

Instructions

You need to create and run your PySpark app on the university cluster. If you need access, please contact the instructor (@FirasJolha) on telegram.

For the sections spark-rdd, spark-dataframe and spark-sql, you need to prepare a Jupyter notebook app.ipynb using the kernel Pyspark 3.11 on the cluster and write and execute all queries in this notebook.

Important notes:

The cells of the notebook should be have been executed and also should have had outputs. The cells which do not have an output or not executed or edited but not executed again will not be evaluated.
Each cell should have at least an execution order and an output (if it gives an output).
Do not forget to add your comments if it is necessary.

For Spark MLlib section, you need to write your PySpark application in app.py file and submit it using spark-submit. You need to create a script file run.sh that will prepare the assignment workspace and installs all required dependencies. After that, the same script run.sh will run the application using the yarn resource manager as master and client deploy mode. Redirect the stdout and stderr (&>) of the command spark-submit to a file output.txt. This fille will contain the output of this app. You can keep the log level to ERROR in the application if you prefer.

Submit to Moodle four files app.ipynb, app.py, run.py, and output.txt. Do not zip these files when you submit.

Plagiarism policy

It is allowed to use any support tools but not recommended to us GenAI tools for generating text or code. You can use these generative tools to proofread your report or check the syntax of the code. All solutions must be submitted individually. We will perform a plagiarism check on the code and report and you will be penalized if your submission is found to be plagiarized.

Assignment Tasks

NOTE:

Proof of implementation

Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.

[15 points] Exercises on Spark RDD

Note: In this list of exercises, you have to use the pyspark.rdd API. You need to perform the analysis using Spark RDD API.

Download the dataset from the given link in the beginning of this document and put it in HDFS in a folder /data/<your-name>. List the content of the folder /data/<your-name> in HDFS using hdfs dfs command.

Proof of implementation

Show the content of the folder /data/<your-name> in HDFS.

Create a Python virtual environment using venv and add any packages you need.
Write a PySpark application which will read the data from HDFS, transform each line or the log of each request in the data file into an RDD of dictionaries where each item in the RDD is stored as a dictionary. We want to keep only the information (remote_addr, time_local, request, url, status, body_bytes_sent) per request. For example, the first line in the file should be transformed into:

{'remote_addr': '54.36.149.41', # str
 'time_local': { # dict
     'year': 2019, # int
     'month': 1, # int
     'day': 22, # int
     'hours': 3, # int
     'minutes': 56, # int
     'seconds': 14 # int
 },
 'request': 'GET', # str
 'url': '/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53',  # str
 'status': 200, # int
 'body_bytes_sent': 30577 # int
}

Here I dropped some of the metadata from the request line. We assume that we need to keep only the fields mentioned in the example (remote_addr, time_local, request, url, status, body_bytes_sent).

Note: It is not allowed to change the format/schema or the data types of the fields, so they should be stored as the given schema and datatypes for the fields.

Proof of implementation

Show the number of items in the RDD.

Show first 10 items of the RDD.

Create an RDD that contains the number of GET requests per year.

Proof of implementation

Show the first 10 results from the RDD sorted by most recent years.

Create an RDD that contains the unique ip addresses that has top 10 most successful GET requests.

Proof of implementation

Show the first 10 results from the RDD sorted by top 10 most successful GET requests…

[35 points] Exercises on Spark DataFrame

Note: In this list of exercises, you have to use the pyspark.sql.DataFrame API. You need to perform the analysis using Spark DataFrame API.

Create a Spark DataFrame with appropriate column names and data types based on the parsed data. For example, the first line in the file should be transformed into:

{'remote_addr': '54.36.149.41', # StringType
 'time_local': { # StructType
     'year': 2019, # IntegerType
     'month': 1, # IntegerType
     'day': 22, # IntegerType
     'hours': 3, # IntegerType
     'minutes': 56, # IntegerType
     'seconds': 14 # IntegerType
 },
 'request': 'GET', # StringType
 'url': '/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53',  # StringType
 'status': 200, # IntegerType
 'body_bytes_sent': 30577 # IntegerType
}

Proof of implementation

Show the size of the dataframe (rows and columns).

Show the schema of the dataframes after reading.

Show first 10 samples of the dataframes.

Filter the DataFrame to show only the log entries with a status code of 200.

Proof of implementation

Show the number of records in the dataframe.

Show first 10 samples of the dataframes.

Find all requests that resulted in a 4xx error (status code between 400 and 499).

Proof of implementation

Show the number of records in the dataframe.

Show first 10 samples of the dataframes.

Categorize the requests based on the status code (e.g., “Success” for 2xx, “Client Error” for 4xx, “Server Error” for 5xx).

Proof of implementation

Show first 10 samples of the dataframe.

Show the number of records per category in the dataframe.

Calculate the total body_bytes_sent by each remote_addr.

Proof of implementation

Show first 10 samples of the dataframe sorted by maximum total of bytes of log body.

Analyse the linear correlation between total number of successful requests and the average body size for the logs per year and explain whether the correlation coefficient is significant or not. Use 5% significance level.

Proof of implementation

Show first 10 values for both variables (total number of successful requests, average body size) before calculating the correlation coefficient.

Show the number of the values for which the correlation will be calculated.

Show the final correlation coefficient.

Show the result of statistical significance of the correlation coefficient.

Define a UDF for extracting the first part from the request endpoint. For example, the request endpoint /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 has the first part as /filter. Your UDF should extract this part for all log lines.

Proof of implementation

Show the results for first 10 records after applying the UDF on the dataframe. In the output the column before and after applying the UDF should be displayed.

[20 points] Exercises on Spark SQL

Note: In this list of exercises, you have to use the spark.sql function. You need to perform the analysis using Spark SQL queries.

Proof of implementation

Show all tables in the catalog.

Write a SQL query to select all columns from the nginx_logs view.

Proof of implementation

Show the first 10 records of this view.

Write a SQL query to find all GET requests for resources under the /product/ path.

Proof of implementation

Show the first 10 results of this query.

Write a SQL query to find the top 10 most frequent ip addresses by number of requests.

Proof of implementation

Show the top 10 results of this query.

Write a SQL query to find the top 10 most requested resources (request path) that resulted in a 404 error.

Proof of implementation

Show the top 10 results of this query.

[30 points] Exercises on Spark MLlib

Note: In this list of exercises, you have to use the pyspark.ml and/or pyspark.mllib modules.

Create a Pipeline of data transformers (pyspark.ml.Transformer) to encode all the following features in the dataframe.

remote address (categorical feature)
request method (categorical feature)
request uri (text feature)
status (categorical feature)
body_bytes_sent (numerical feature)

For categorical features you can use one-hot-encoding or any other appropriate encoding method. For text features, you can use word2vec or any other appropriate method but not one-hot-encoding. For numerical features, you need to scale it using an appropriate scaler. Add any other transformers needed here. Fit and transform the dataframe.
2. Create a new column label to indicate whether the request had 200 status code. Encode this column using an appropriate transformer.
3. Build two different classifier to classify the successfulness of requests. The classifiers should be based on different classification methods like Decision Tree and Support Vector Machine.
4. Split the dataset and use the training set for training and test set for reporting the performance of the models.
5. Train the models.
6. Evaluate the performance of both classifiers using accuracy and f1 metrics. Use the test data. Which model is better in terms of f1?
7. Using the best model, predict the success of a specific request from the test set.

Proof of implementation

Show the first 10 samples of nginx_logs dataframe before encoding.

Show the first 10 samples of nginx_logs dataframe after encoding.

Show the size of the train and test dataframes (only number of rows).

Show the value of f1 and accuracy for the first model.

Show the value of f1 and accuracy for the second model.

Answer the question.

Show the prediction result for the selected request. The output should have the label and the predicted label.

References

Zaker, Farzin, 2019, “Online Shopping Store - Web Server Logs”, https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
Spark 3 Python API Docs