Course: Big Data - IU S25
Author: Firas Jolha
This assignment provides hands-on experience in using Spark to analyze a real-world dataset and extract meaningful business insights.
Web sever logs contain information on any event that was registered/logged. This contains a lot of insights on website visitors, behavior, crawlers accessing the site, business insights, security issues, and more. This midterm consists of two clusters (MongoDB and Hadoop). In Hadoop cluster will be used to transform the dataset and load it to the MongoDB using PySpark. In MongoDB cluster, you will analyze the data and store the results of data analytics. The data in MongoDB should be stored in a sharded collection.
The dataset can be download from the link attached to this document. This dataset has access logs of Ngix server where each line stores the log info of a single request to the server. The typical log format in Nginx servers is as follows:
'$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent"';
You can see below the first 15 lines of this data file. You need to explore the dataset to understand its schema.
54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/61474/productModel/200x200 HTTP/1.1" 200 5379 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/14925/productModel/100x100 HTTP/1.1" 200 1696 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:17 +0330] "GET /product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%D8%B1-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR257AT HTTP/1.1" 200 41483 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0)Gecko/16.0 Firefox/16.0" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/23488/productModel/150x150 HTTP/1.1" 200 2654 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/45437/productModel/150x150 HTTP/1.1" 200 3688 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/576/article/100x100 HTTP/1.1" 200 14776 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
66.249.66.194 - - [22/Jan/2019:03:56:18 +0330] "GET /filter/b41,b665,c150%7C%D8%A8%D8%AE%D8%A7%D8%B1%D9%BE%D8%B2,p56 HTTP/1.1" 200 34277 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:18 +0330] "GET /image/57710/productModel/100x100 HTTP/1.1" 200 1695 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
207.46.13.136 - - [22/Jan/2019:03:56:18 +0330] "GET /product/10214 HTTP/1.1" 200 39677 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
40.77.167.129 - - [22/Jan/2019:03:56:19 +0330] "GET /image/578/article/100x100 HTTP/1.1" 200 9831 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
178.253.33.51 - - [22/Jan/2019:03:56:19 +0330] "GET /m/product/32574/62991/%D9%85%D8%A7%D8%B4%DB%8C%D9%86-%D8%A7%D8%B5%D9%84%D8%A7%D8%AD-%D8%B5%D9%88%D8%B1%D8%AA-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR465AT HTTP/1.1" 200 20406 "https://www.zanbil.ir/m/filter/p5767%2Ct156?name=%D9%85%D8%A7%D8%B4%DB%8C%D9%86-%D8%A7%D8%B5%D9%84%D8%A7%D8%AD&productType=electric-shavers" "Mozilla/5.0 (Linux; Android 5.1; HTC Desire 728 dual sim) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.83 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:19 +0330] "GET /image/6229/productModel/100x100 HTTP/1.1" 200 1796 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:19 +0330] "GET /product/10075/13903/%D9%85%D8%A7%DB%8C%DA%A9%D8%B1%D9%88%D9%81%D8%B1-%D8%B1%D9%88%D9%85%DB%8C%D8%B2%DB%8C-%D8%B3%D8%A7%D9%85%D8%B3%D9%88%D9%86%DA%AF-%D9%85%D8%AF%D9%84-CE288 HTTP/1.1" 200 41725 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.92 Safari/537.36" "-"
You need to create and run your PySpark app on the university cluster. If you need access, please contact the instructor (@FirasJolha) on telegram.
For the sections spark-rdd, spark-dataframe and spark-sql, you need to prepare a Jupyter notebook app.ipynb using the kernel Pyspark 3.11 on the cluster and write and execute all queries in this notebook.
Important notes:
For Spark MLlib section, you need to write your PySpark application in app.py file and submit it using spark-submit. You need to create a script file run.sh that will prepare the assignment workspace and installs all required dependencies. After that, the same script run.sh will run the application using the yarn resource manager as master and client deploy mode. Redirect the stdout and stderr (&>) of the command spark-submit to a file output.txt. This fille will contain the output of this app. You can keep the log level to ERROR in the application if you prefer.
Submit to Moodle four files app.ipynb, app.py, run.py, and output.txt. Do not zip these files when you submit.
It is allowed to use any support tools but not recommended to us GenAI tools for generating text or code. You can use these generative tools to proofread your report or check the syntax of the code. All solutions must be submitted individually. We will perform a plagiarism check on the code and report and you will be penalized if your submission is found to be plagiarized.
NOTE:
Proof of implementation
Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.
Note: In this list of exercises, you have to use the pyspark.rdd API. You need to perform the analysis using Spark RDD API.
/data/<your-name>. List the content of the folder /data/<your-name> in HDFS using hdfs dfs command.Proof of implementation
- Show the content of the folder
/data/<your-name>in HDFS.
venv and add any packages you need.(remote_addr, time_local, request, url, status, body_bytes_sent) per request. For example, the first line in the file should be transformed into:{'remote_addr': '54.36.149.41', # str
'time_local': { # dict
'year': 2019, # int
'month': 1, # int
'day': 22, # int
'hours': 3, # int
'minutes': 56, # int
'seconds': 14 # int
},
'request': 'GET', # str
'url': '/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53', # str
'status': 200, # int
'body_bytes_sent': 30577 # int
}
Here I dropped some of the metadata from the request line. We assume that we need to keep only the fields mentioned in the example (remote_addr, time_local, request, url, status, body_bytes_sent).
Note: It is not allowed to change the format/schema or the data types of the fields, so they should be stored as the given schema and datatypes for the fields.
Proof of implementation
- Show the number of items in the RDD.
- Show first 10 items of the RDD.
Proof of implementation
- Show the first 10 results from the RDD sorted by most recent years.
Proof of implementation
- Show the first 10 results from the RDD sorted by top 10 most successful GET requests…
Note: In this list of exercises, you have to use the pyspark.sql.DataFrame API. You need to perform the analysis using Spark DataFrame API.
{'remote_addr': '54.36.149.41', # StringType
'time_local': { # StructType
'year': 2019, # IntegerType
'month': 1, # IntegerType
'day': 22, # IntegerType
'hours': 3, # IntegerType
'minutes': 56, # IntegerType
'seconds': 14 # IntegerType
},
'request': 'GET', # StringType
'url': '/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53', # StringType
'status': 200, # IntegerType
'body_bytes_sent': 30577 # IntegerType
}
Proof of implementation
- Show the size of the dataframe (rows and columns).
- Show the schema of the dataframes after reading.
- Show first 10 samples of the dataframes.
Proof of implementation
- Show the number of records in the dataframe.
- Show first 10 samples of the dataframes.
Proof of implementation
- Show the number of records in the dataframe.
- Show first 10 samples of the dataframes.
Proof of implementation
- Show first 10 samples of the dataframe.
- Show the number of records per category in the dataframe.
body_bytes_sent by each remote_addr.Proof of implementation
- Show first 10 samples of the dataframe sorted by maximum total of bytes of log body.
Proof of implementation
- Show first 10 values for both variables (total number of successful requests, average body size) before calculating the correlation coefficient.
- Show the number of the values for which the correlation will be calculated.
- Show the final correlation coefficient.
- Show the result of statistical significance of the correlation coefficient.
/filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 has the first part as /filter. Your UDF should extract this part for all log lines.Proof of implementation
- Show the results for first 10 records after applying the UDF on the dataframe. In the output the column before and after applying the UDF should be displayed.
Note: In this list of exercises, you have to use the spark.sql function. You need to perform the analysis using Spark SQL queries.
nginx_logs).Proof of implementation
- Show all tables in the catalog.
nginx_logs view.Proof of implementation
- Show the first 10 records of this view.
GET requests for resources under the /product/ path.Proof of implementation
- Show the first 10 results of this query.
Proof of implementation
- Show the top 10 results of this query.
request path) that resulted in a 404 error.Proof of implementation
- Show the top 10 results of this query.
Note: In this list of exercises, you have to use the pyspark.ml and/or pyspark.mllib modules.
pyspark.ml.Transformer) to encode all the following features in the dataframe.remote address (categorical feature)
request method (categorical feature)
request uri (text feature)
status (categorical feature)
body_bytes_sent (numerical feature)
For categorical features you can use one-hot-encoding or any other appropriate encoding method. For text features, you can use word2vec or any other appropriate method but not one-hot-encoding. For numerical features, you need to scale it using an appropriate scaler. Add any other transformers needed here. Fit and transform the dataframe.
2. Create a new column label to indicate whether the request had 200 status code. Encode this column using an appropriate transformer.
3. Build two different classifier to classify the successfulness of requests. The classifiers should be based on different classification methods like Decision Tree and Support Vector Machine.
4. Split the dataset and use the training set for training and test set for reporting the performance of the models.
5. Train the models.
6. Evaluate the performance of both classifiers using accuracy and f1 metrics. Use the test data. Which model is better in terms of f1?
7. Using the best model, predict the success of a specific request from the test set.
Proof of implementation
- Show the first 10 samples of
nginx_logsdataframe before encoding.- Show the first 10 samples of
nginx_logsdataframe after encoding.- Show the size of the train and test dataframes (only number of rows).
- Show the value of
f1andaccuracyfor the first model.- Show the value of
f1andaccuracyfor the second model.- Answer the question.
- Show the prediction result for the selected request. The output should have the label and the predicted label.