Assignment 3 - Apache Spark SQL & MLlib

Course: Big Data - IU S25
Author: Firas Jolha

Dataset

Assignment 3 Colab notebook

Moodle submission

Agenda

Prerequisites

Instructions

In this assignment, you will work on Colab and you must use the notebook template shared with you above. In the template, you can see how it installs PySpark and publishes the Spark UI using localtunnel. Here, you work on a Colab notebook and you need to submit only the notebook as .ipynb to Moodle.

Important notes:

  1. The cells of the notebook should be executed and also should have outputs. The cells which do not have an output or not executed or edited but not executed again will not be evaluated.
  2. Each cell should have at least an execution order and an output (if it gives an output).
  3. The assignment is individual and the student is NOT allowed to copy from their colleagues nor from AI-based support tools like chatgpt. The plagiarism detection will be strictly applied in this assignment. Any plaigarized submission will get zero points and an appropriate action will be taken.

Note: In this assignment, you do not work on the whole dataset files. You must take a 10% sample of the products dataframe and use it for analysis where your SID will be utilized as seed for the random state of the pyspark.sql.DataFrame.sample function. The shared notebook contains the code snippets to take the sample as follows:

# df_products is the dataframe after successfully reading it from the file products.jsonl using the schema
# Rename some columns
df_products = df_products.withColumnsRenamed(
    {
      "title": "product_title",
      "images": "product_images"
    }
)


df_products_sample = df_products.sample(withReplacement=False, fraction = 0.1, seed = SID)

# Now you can delete the original dataframe
del df_products 
# df_reviews is the dataframe after successfully reading it from the file reviews.jsonl using the schema
# Rename some columns
df_reviews = df_reviews.withColumnsRenamed(
    {
      "title": "review_title",
      "text": "review_text",
      "timestamp": "review_timestamp",
      "images": "reviewed_product_images"
     }
)

cols = df_reviews.columns


# Find the reviews of this dataframe
df_reviews_sample = df_reviews.join(df_products_sample, on='parent_asin',how='inner').select(cols)

# Now you can delete the original dataframe
del df_reviews

# Now you must use only these samples for all analysis tasks.

Assignment Description

This assignment will be dedicated to practise on Apache Spark SQL and machine learning libraries. For this task, you can use the pyspark.sql, the legacy RDD-based MLlib in pyspark.mllib and DataFrame-based MLlib pyspark.ml for Python. You have to work on tasks dedicated for data analysis using Spark SQL and predictive analysis using different ML tasks on the same dataset. You also need to train a recommendation system for suggesting products to Amazon users. You have to work on a sample of Amazon products dataset.

The dataset includes Amazon users who rated n=137,269 products and added r=4,624,615 reviews. Check the description of the datasets below.

Dataset Description

This dataset consists of two jsonl files. reviews.jsonl contains users’ reviews for products on Amazon. products.jsonl contains the product metadata. The description of fields in reviews.jsonl as follows:

You can see below a sample of 10 records of the file reviews.jsonl

{"rating": 4.0, "title": "It\u2019s pretty sexual. Not my fav", "text": "I\u2019m playing on ps5 and it\u2019s interesting.  It\u2019s unique, massive, and has a neat story.  People are freaking out angry about this game.  I don\u2019t think it\u2019s a top 10 game but it\u2019s definitely a good game on ps5 (played at launch).", "images": [], "asin": "B07DJWBYKP", "parent_asin": "B07DK1H3H5", "user_id": "AGCI7FAH4GL5FI65HYLKWTMFZ2CQ", "timestamp": 1608186804795, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Good. A bit slow", "text": "Nostalgic fun.  A bit slow.  I hope they don\u2019t stretch it out too far.  It\u2019s good tho", "images": [], "asin": "B00ZS80PC2", "parent_asin": "B07SRWRH5D", "user_id": "AGCI7FAH4GL5FI65HYLKWTMFZ2CQ", "timestamp": 1587051114941, "helpful_vote": 1, "verified_purchase": false}
{"rating": 5.0, "title": "... an order for my kids & they have really enjoyed playing this PC game", "text": "This was an order for my kids & they have really enjoyed playing this PC game.", "images": [], "asin": "B01FEHJYUU", "parent_asin": "B07MFMFW34", "user_id": "AGXVBIUFLFGMVLATYXHJYL4A5Q7Q", "timestamp": 1490877431000, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Great alt to pro controller", "text": "These work great, They use batteries which is a bummer, but for the 40 less that i paid its worth it. Batteries last a long time. Have been using to play rocket league on the switch with no issues", "images": [], "asin": "B07GXJHRVK", "parent_asin": "B0BCHWZX95", "user_id": "AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q", "timestamp": 1577637634017, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "solid product", "text": "I would recommend to anyone looking to add just a little bit of height and a lot of grip to their thumb sticks. These will not create miracles, but it will give you better leverage for shooters.", "images": [], "asin": "B00HUWA45W", "parent_asin": "B00HUWA45W", "user_id": "AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q", "timestamp": 1427591932000, "helpful_vote": 0, "verified_purchase": true}
{"rating": 3.0, "title": "love all the amazing colors but the black is really ...", "text": "love all the amazing colors but the black is really hard to see and I always have to have another form of a light on to use the key board now", "images": [], "asin": "B016Y2BVKA", "parent_asin": "B073SC6V1D", "user_id": "AHXSBZT52TCPZUBVCBRICTHWUCBA", "timestamp": 1518124539574, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Will use again", "text": "Instant delivery!", "images": [], "asin": "B004RMK57U", "parent_asin": "B004RMK57U", "user_id": "AHZIJGKEWRTAEOZ673G5B3SNXEGQ", "timestamp": 1602937709361, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "if its prime day and you're contemplating, just stop -- do it.", "text": "you should probably get these. you'll journal about the next-level experience, or at least sit there and revel for a couple min. honestly, it's like you've never heard footsteps before. or windows systems settings. they recommend against using 7.1 surround sound on microsoft teams and skype -- but nobody said SMACK about zoom. these professors be twich streamin asmr evry time they take a drink of their 'make me talky' water. sometimes it blocks out my boyfriends nonsense caveman grumblins while hes playing rocket league and im on the sims, or i guess you can turn up the ambience and pretend you cant hear someone unironically ironically screaming POGGERS every 3 min. great deal on prime day btw. now, razer, or amazon, whoever comes first, just keep the discounts for the next year please so I can sell my last kidney to afford the rest of my pc build. thanks. cheers xx<br /><br />ps. half-joking about the selling my organs thing and poggers thing. kind of. hook it up for ur fav broke AF college graphic designer, would you pleaseee. :) &lt;3", "images": [{"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL1600_.jpg", "attachment_type": "IMAGE"}, {"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL1600_.jpg", "attachment_type": "IMAGE"}, {"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL1600_.jpg", "attachment_type": "IMAGE"}], "asin": "B07N85FY1G", "parent_asin": "B0BYVN9ZK2", "user_id": "AFO6QN6ICKWUFQV3UEWK5EECIQTQ", "timestamp": 1602718512453, "helpful_vote": 0, "verified_purchase": false}
{"rating": 5.0, "title": "Price bumps it up from 4 stars", "text": "*it fits TWO wired Retro-bit 6 button controllers. Yes, the ones with the 8 ft cables.<br />Plus, a cut-out area for the console, and the factory hdmi and power supply fit it the lid pouch.<br /><br />warning: the stock 3 button controllers won't fit<br /><br />great simple case with good stamp-cut foam padding", "images": [], "asin": "B08L6782X9", "parent_asin": "B08L6782X9", "user_id": "AG6BAEKWLCWH2TW3KKLVK773YF6A", "timestamp": 1621448670253, "helpful_vote": 0, "verified_purchase": true}
{"rating": 1.0, "title": "It's an Auto-renew scam", "text": "Sony and Amazon are collaborating in an Auto-renew scam<br /><br />Buying this turns on Auto-renew allowing Sony to charge double the annual fee<br /><br />Sony is exploiting the financially challenged, Amazon gets kickbacks, and people who deserved to be burned alive are unscathed.", "images": [], "asin": "B017V6YVDC", "parent_asin": "B017V6YVDC", "user_id": "AG6BAEKWLCWH2TW3KKLVK773YF6A", "timestamp": 1607734474794, "helpful_vote": 2, "verified_purchase": true}

The description of fields in products.csv as follows:

You can see below a sample of 10 records of the file products.jsonl

{"main_category": "Video Games", "title": "Dash 8-300 Professional Add-On", "average_rating": 5.0, "rating_number": 1, "features": ["Features Dash 8-300 and 8-Q300 ('Q' rollout livery)", "Airlines - US Airways, South African Express, Bahamasair, Augsburg Airways, Lufthansa Cityline, British Airways (Union Jack), British European, FlyBe, Intersky, Wideroe, Iberia, Tyrolean, QantasLink, BWIA", "Airports include - London City, Frankfurt, Milan and Amsterdam Schipol", "Includes PSS PanelConfig and LoadEdit tools"], "description": ["The Dash 8-300 Professional Add-On lets you pilot a real commuter special. Fly two versions of the popular Dash 8-300 in a total of 17 different liveries. The Dash 8-300 is one of the most popular short-haul aircraft available and this superbly modelled version from acclaimed aircraft developers PSS is modelled in two versions with a total of 17 different liveries. The package also includes scenery for three European airports, tutorials, tutorial flights and utilities together in one fantastic package."], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/21DVWE41A0L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/21DVWE41A0L.jpg", "variant": "MAIN", "hi_res": null}], "videos": [], "store": "Aerosoft", "categories": ["Video Games", "PC", "Games"], "details": {"Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "7.5 x 5.5 x 0.6 inches; 4.8 Ounces", "Type of item": "CD-ROM", "Rated": "Everyone", "Item Weight": "4.8 ounces", "Manufacturer": "Aerosoft N.A. LTD", "Date First Available": "October 2, 2001"}, "parent_asin": "B000FH0MHO", "bought_together": null}
{"main_category": "Video Games", "title": "Phantasmagoria: A Puzzle of Flesh", "average_rating": 4.1, "rating_number": 18, "features": ["Windows 95"], "description": [], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/51pqAznTA9L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51pqAznTA9L.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/71hD-k6kaxL._SL1101_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/61CCFhIg4qL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/61CCFhIg4qL.jpg", "variant": "PT01", "hi_res": "https://m.media-amazon.com/images/I/81dGuRrFwAL._SL1104_.jpg"}], "videos": [], "store": "Sierra", "categories": ["Video Games", "PC", "Games"], "details": {"Best Sellers Rank": {"Video Games": 137612, "PC-compatible Games": 6707}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "5.6 x 4.9 x 0.9 inches; 6.4 Ounces", "Type of item": "CD-ROM", "Rated": "Mature", "Is Discontinued By Manufacturer": "No", "Item Weight": "6.4 ounces", "Manufacturer": "Sierra", "Date First Available": "March 30, 2006"}, "parent_asin": "B00069EVOG", "bought_together": null}
{"main_category": "Video Games", "title": "NBA 2K17 - Early Tip Off Edition - PlayStation 4", "average_rating": 4.3, "rating_number": 223, "features": ["The #1 rated NBA video game simulation series for the last 15 years (Metacritic).", "The #1 selling NBA video game simulation series for the last 9 years (NPD).", "Over 85 awards and nominations since the launch of PlayStation 4 & Xbox One.", "BEST IN CLASS GAMEPLAY - 2K puts shot making in your hands like never before. Advanced Skill Shooting gives you complete control over the power and aim of your perimeter shots as well as your ability to finish inside the paint.", "THE PRELUDE - Begin your MyCAREER on one of 10 licensed collegiate programs, available for free download one week prior to launch!", "MyCAREER - It\u2019s all-new and all about basketball in 2K17 \u2013 and you\u2019re in control. Your on-court performance and career decisions lead to different outcomes as you determine your path through an immersive new narrative, featuring Michael B. Jordan. Additionally, new player controls give you unparalleled supremacy on the court.", "USA BASKETBALL - Take the court as Team USA with Coach K on the sidelines, or relive the glory of the \u201992 Dream Team. Earn USAB MyTEAM cards and gear up your MyPLAYER with official USAB wearables.", "COLLEGE INTEGRATION - For the first time, play as college basketball legends with each school\u2019s all-time greats team and MyTEAM cards.", "LEAGUE EXPANSION - For the first time, customize your MyLEAGUE and MyGM experience with league expansion. Choose your expansion team names, logos and uniforms, and share them with the rest of the NBA 2K community. Your customized league comes complete with everything from Expansion Drafts to modified schedules and more to ensure an authentic NBA experience.", "2K BEATS Imagine Dragons, Grimes, Noah \u201c40\u201d Shabib of OVO Sound and Michael B. Jordan curate another electric 2K soundtrack, featuring 50 songs."], "description": ["Following the record-breaking launch of NBA 2K16, the NBA 2K franchise continues to stake its claim as the most authentic sports video game with NBA 2K17. As the franchise that \u201call sports video games should aspire to be\u201d (GamesRadar), NBA 2K17 will take the game to new heights and continue to blur the lines between video game and reality."], "price": 58.0, "images": [{"thumb": "https://m.media-amazon.com/images/I/51wlIPcf0gL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51wlIPcf0gL.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/81MtBG4xXhL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51smI92XGdL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51smI92XGdL.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/81fby40wGQL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41B2Li+r-6L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41B2Li+r-6L.jpg", "variant": "PT05", "hi_res": "https://m.media-amazon.com/images/I/71VFYALs8qL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/518ADy9h+wL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/518ADy9h+wL.jpg", "variant": "PT06", "hi_res": "https://m.media-amazon.com/images/I/71ZbTa0QT4L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41EsOEFBg0L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41EsOEFBg0L.jpg", "variant": "PT07", "hi_res": "https://m.media-amazon.com/images/I/71kIQMjnwWL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41JUeBKY2EL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41JUeBKY2EL.jpg", "variant": "PT08", "hi_res": "https://m.media-amazon.com/images/I/71RyXZvZA2L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41aPDXtqZxL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41aPDXtqZxL.jpg", "variant": "PT09", "hi_res": "https://m.media-amazon.com/images/I/71PdzeowO3L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411-B81va7L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411-B81va7L.jpg", "variant": "PT10", "hi_res": "https://m.media-amazon.com/images/I/81Anx48IQUL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41g1KDskmML._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41g1KDskmML.jpg", "variant": "PT11", "hi_res": "https://m.media-amazon.com/images/I/71mrJKrLwJL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411DcBq41HL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411DcBq41HL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/61CSnSRFIJL._SL1500_.jpg"}], "videos": [{"title": "NBA 2K17 - Kobe: Haters vs Players", "url": "https://www.amazon.com/vdp/386e44f88d0f41d99714076c93459753?ref=dp_vse_rvc_0", "user_id": ""}], "store": "2K", "categories": ["Video Games", "PlayStation 4", "Games"], "details": {"Release date": "September 16, 2016", "Best Sellers Rank": {"Video Games": 57637, "PlayStation 4 Games": 2886}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "0.4 x 5.3 x 6.6 inches; 1.6 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "47793", "Is Discontinued By Manufacturer": "No", "Item Weight": "1.6 ounces", "Manufacturer": "2K Games", "Date First Available": "April 13, 2016"}, "parent_asin": "B00Z9TLVK0", "bought_together": null}
{"main_category": "Video Games", "title": "Nintendo Selects: The Legend of Zelda Ocarina of Time 3D (Renewed)", "average_rating": 4.9, "rating_number": 22, "features": ["Authentic Nintendo Selects: The Legend of Zelda Ocarina of Time 3D", "Does not come with original case or manuals. Cartridge only", "Cartridge and label are in nice condition", "Fully tested and guaranteed"], "description": [], "price": 37.42, "images": [{"thumb": "https://m.media-amazon.com/images/I/51raO0wAe8L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51raO0wAe8L.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/81dM82yx6wL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51ag4Lai25L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51ag4Lai25L.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41D7zacd5cL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41D7zacd5cL.jpg", "variant": "PT02", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41sNSMvZGAL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41sNSMvZGAL.jpg", "variant": "PT03", "hi_res": null}], "videos": [], "store": "Amazon Renewed", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Nintendo 3DS & 2DS", "Games"], "details": {"Best Sellers Rank": {"Video Games": 51019, "Nintendo 3DS & 2DS Games": 432}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "0.5 x 5.4 x 4.9 inches; 2.05 Ounces", "Type of item": "Video Game", "Rated": "Everyone 10+", "Is Discontinued By Manufacturer": "No", "Item Weight": "2.04 ounces", "Manufacturer": "Nintendo", "Date First Available": "June 14, 2019"}, "parent_asin": "B07SZJZV88", "bought_together": null}
{"main_category": "Video Games", "title": "Thrustmaster Elite Fitness Pack for Nintendo Wii", "average_rating": 3.0, "rating_number": 3, "features": ["Includes (9) Total Accessories", "Pedometer", "Wii Fit Balance Board Stepper", "Floor Mat made from high density Woven Foam", "(2) flexible Arm/Leg Weights"], "description": ["The Thrustmaster Motion Plus Elite Fitness Pack for Wii is Ideal for Nintendo Wii Fit & Wii Fit Plus games such as EA Active (EA), U Shape & My Fitness Coach (UbiSoft). Ultimate pack with 9 accessories for the Nintendo Wii Fit Balance Board including: (1) floor mat made from woven foam Size (inches) : 70x20, (2) flexible ankle or wrist weights, (1) stepper for the Wii Balance Board, (1) pedometer to count your steps during each in-game exercise or train without games, (1) armband for Wiimote or Wiimote with MotionPlus - for use in menus without stepping away from the training area, (1) leg band for the Nunchuck controller (1) lanyard to your MP3 player around your neck and (1) carry bag featuring trendy sport design with two internal pockets: ideal for storing, protecting and transporting your Wii Balance Board with one game and ALL Elite Fitness pack accessories! 2 Year Warranty!"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/31spO9JKluL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31spO9JKluL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31rX51CAv5L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31rX51CAv5L.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31rNKdcmUQL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31rNKdcmUQL.jpg", "variant": "PT02", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31xkaqpKS+L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31xkaqpKS+L.jpg", "variant": "PT03", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/314yku+VEIL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/314yku+VEIL.jpg", "variant": "PT04", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41ezKm50QUL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41ezKm50QUL.jpg", "variant": "PT05", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31u0pX5P+BL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31u0pX5P+BL.jpg", "variant": "PT06", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31byOVYmfUL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31byOVYmfUL.jpg", "variant": "PT07", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/3160QI-NIXL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/3160QI-NIXL.jpg", "variant": "PT08", "hi_res": null}], "videos": [], "store": "THRUSTMASTER", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Wii", "Accessories", "Fitness Accessories"], "details": {"Release date": "November 1, 2009", "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "19.8 x 14 x 4.75 inches; 7.35 Pounds", "Type of item": "Video Game", "Language": "English, French", "Item model number": "4660368", "Item Weight": "7.35 pounds", "Manufacturer": "Thrustmaster VG", "Date First Available": "September 11, 2009"}, "parent_asin": "B002WH4ZJG", "bought_together": null}
{"main_category": "Video Games", "title": "Grand Prix 4", "average_rating": 3.7, "rating_number": 18, "features": ["Operating System: Windows 98, ME, XP", "Developer: Geoff Crammond", "Publisher: Infogrames", "Genre: Formula 1"], "description": ["Grand Prix 4 is the fourth installment in the GP-Series by Geoff Crammond. However, this time around, Microprose had a much larger role in the development of the game. For the first time in the Grand-Prix-Series, Grand Prix 4 had to go directly head-to-head with an EA-Sports offering, in the form of F1 2002. The flame wars and heated debate about which is better has not stopped since. Grand Prix 4 is not a revolution over Grand Prix 3, but more an evolution, as was Grand Prix 3 over Grand Prix 2. Today the game still lives on with plenty of new downloads and updates available to bring the game up to the latest season."], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/51167J4AQSL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51167J4AQSL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/5128X6W23SL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/5128X6W23SL.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/51XE41TZXDL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51XE41TZXDL.jpg", "variant": "PT03", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/51PE3CA5B7L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51PE3CA5B7L.jpg", "variant": "PT04", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [], "store": null, "categories": ["Video Games", "PC", "Games"], "details": {"Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "7.64 x 5.43 x 1.42 inches; 7.04 Ounces", "Type of item": "Video Game", "Is Discontinued By Manufacturer": "No", "Item Weight": "7 ounces", "Date First Available": "November 28, 2011"}, "parent_asin": "B00005Y3OJ", "bought_together": null}
{"main_category": "Video Games", "title": "Spongebob Squarepants, Vol. 1", "average_rating": 4.4, "rating_number": 32, "features": ["Bubblestand: SpongeBob shows Patrick and Squidward his unique talent for blowing bubbles. Squidward attempts to surpass SpongeBobs expertise but not everything goes according to plan!", "Ripped Pants: When SpongeBob tries to impress Sandy Cheeks at Mussel Beach, he accidentally rips his pants. The beach crowd loves the unintentional joke until SpongeBob pushes it too far.", "Jellyfishing: While Squidward is recovering from an accident, SpongeBob and Patrick take him jellyfishing. But the two unwittingly thrust poor Squidward into the hazardous rigors of their favorite pastime.", "Plankton: When evil Plankton takes control of his brain, SpongeBob must fight his own body to prevent himself from revealing the top secret Krabby Patty recipe!"], "description": ["Now you can watch the wild underwater antics of SpongeBob SquarePants on your Game Boy Advance, with this collection of 4 great episodes. In \"Hall Monitor,\" an overzealous SpongeBob becomes the new Hall Monitor at Mrs. Puff's Boating School, and extends his jurisdiction to the unsuspecting citizens of Bikini Bottom. In \"Jellyfish Jam,\" SpongeBob takes a wild jellyfish home and discovers that they multiply quickly when they take over his house! In \"Jellyfishing,\" SpongeBob and Patrick unwittingly thrust poor Squidward, who is recovering from an accident, into the hazardous rigors of their favorite pastime, jellyfishing. In \"Plankton,\" evil Plankton takes control of SpongeBob's brain, and SpongeBob must fight his own body to prevent himself from revealing the top secret Krabby Patty recipe!"], "price": 33.98, "images": [{"thumb": "https://m.media-amazon.com/images/I/611XSBQ4RBL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/611XSBQ4RBL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [], "store": "Majesco", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Game Boy Systems", "Game Boy Advance", "Games"], "details": {"Release date": "August 15, 2004", "Best Sellers Rank": {"Video Games": 47190, "Game Boy Advance Games": 142}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "5.75 x 4 x 1 inches; 0.32 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "GAMJC 096427013426", "Is Discontinued By Manufacturer": "Yes", "Item Weight": "0.32 ounces", "Manufacturer": "Majesco Sales Inc.", "Date First Available": "June 1, 2004"}, "parent_asin": "B0001ZNU56", "bought_together": null}
{"main_category": "Video Games", "title": "eXtremeRate Soft Touch Top Shell Front Housing Faceplate Replacement Parts with Side Rails Panel for Xbox One X & One S Controller (Shadow Purple)", "average_rating": 4.5, "rating_number": 3061, "features": ["Compatibility Models: Ultra fits for Xbox One X & One S controller ; Not compatible with Xbox One Elite controller & Standard Xbox One Controller.Check the second picture of the listing before purchase", "Fit Perfectly: Fit the best by far; Completely fits flush on all side; Sit properly on all the clips", "Package Includes: 1 * Faceplate shell ; 1 * Right side rails; 1 * Left side rails;1*Open Shell Tool;1 * T8H screwdriver; 7* screws. (IMPORTANT: The controller and other parts are not included.)", "Installation Skills Required: Required customers to take apart the controller to install this front housing shell; Required customers handy with controller modifications", "Personalized Feature: The shadow purple color looks great; Great smooth grip, soft in hand and feels silky; Anti slip, sweat free for a long period game playing"], "description": [], "price": 17.59, "images": [{"thumb": "https://m.media-amazon.com/images/I/41reviTraVL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41reviTraVL.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/711FCClp7qL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51Y2sNaPYuL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51Y2sNaPYuL.jpg", "variant": "PT01", "hi_res": "https://m.media-amazon.com/images/I/61WZB0JSbfL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41yHhIt5rDL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41yHhIt5rDL.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/517Xy-auRHL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41SlaNfN4fL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41SlaNfN4fL.jpg", "variant": "PT03", "hi_res": "https://m.media-amazon.com/images/I/61oqzfBYbVL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41+ZJYYRKVL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41+ZJYYRKVL.jpg", "variant": "PT04", "hi_res": "https://m.media-amazon.com/images/I/61gRCbvxyBL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41DKBJyb-EL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41DKBJyb-EL.jpg", "variant": "PT05", "hi_res": "https://m.media-amazon.com/images/I/61oHubD2jhL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/418wNkXtQ2L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/418wNkXtQ2L.jpg", "variant": "PT06", "hi_res": "https://m.media-amazon.com/images/I/61fUe2N9f8L._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411+i-oIjIL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411+i-oIjIL.jpg", "variant": "PT07", "hi_res": "https://m.media-amazon.com/images/I/516v2K0Aq9L._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41dEWqxkoWL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41dEWqxkoWL.jpg", "variant": "PT08", "hi_res": "https://m.media-amazon.com/images/I/51F+pvO+NfL._SL1000_.jpg"}], "videos": [], "store": "eXtremeRate", "categories": ["Video Games", "Xbox One", "Accessories", "Faceplates, Protectors & Skins"], "details": {"Best Sellers Rank": {"Video Games": 48130, "Xbox One Faceplates, Protectors & Skins": 253}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "6.5 x 4.53 x 1.97 inches; 1.76 Ounces", "Type of item": "Video Game", "Item model number": "ZSXOF0", "Is Discontinued By Manufacturer": "No", "Item Weight": "1.76 ounces", "Manufacturer": "Extremerate", "Country of Origin": "China", "Date First Available": "September 11, 2017"}, "parent_asin": "B07H93H878", "bought_together": null}
{"main_category": "Computers", "title": "Set of 4 Bullet Buttons Nickel+Brass for Playstation PS3 PS2 controllers", "average_rating": 4.8, "rating_number": 4, "features": ["Case Color: Silver", "Case Material: Nickel", "Primer Color: Bronze", "Primer Material: Brass", "Brand of ammo shell may vary from the picture shown, but it will look very similar."], "description": ["A set of 4 bullet button to replace all the 4 buttons (Cross, Square, Triangle, Circle) of your PS3 controller. Made from 9mm luger bullet casings, to perfectly fit the PS3. The bullet buttons are already mounted on a special plastic holder that fit directly and perfectly your PS3 controller. Each cartridge shell have a small layer of clear coat that keep the bullet button ultra shiny even after many years. The clear coat also prevent the metallic odor that can be created during extended use . Case Color: Silver Case Material: Nickel Primer Color: Bronze Primer Material: Brass"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/410NNQMoFEL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/410NNQMoFEL._AC_.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41ETKho5m-L._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/41ETKho5m-L._AC_.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/415t4vhHyPL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/415t4vhHyPL._AC_.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/91sScW8kwwL._AC_SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41EFsAYd8ZL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/41EFsAYd8ZL._AC_.jpg", "variant": "PT03", "hi_res": "https://m.media-amazon.com/images/I/812W6x6a9HL._AC_SL1500_.jpg"}], "videos": [], "store": "NEXiLUX", "categories": ["Video Games", "PlayStation 4", "Accessories", "Controllers"], "details": {"Package Dimensions": "4 x 3 x 0.3 inches", "Item Weight": "1 pounds", "Manufacturer": "NEXiLUX", "Date First Available": "September 17, 2012"}, "parent_asin": "B009C9E8JY", "bought_together": null}
{"main_category": "Video Games", "title": "Konami Collector's Series: Castlevania & Contra - PC", "average_rating": 3.6, "rating_number": 19, "features": ["This collection of classic 8-bit games includes five all-time greats - Castlevania, Castlevania II, Castlevania - Dracula's Curse, Contra and Super C", "Customize controls at any time, during any game", "Quick-save lets players save their progress from any point in the game", "Get back into the games you loved with this collection of old-school classics!"], "description": ["The Konami Collector's Series: Castlevania & Contra brings back the classic games that first appeared on the NES for a new generation of computer gamers!"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/21CERVWQ5DL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/21CERVWQ5DL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [{"title": "Konami Collector's Series: Arcade Advanced", "url": "https://www.amazon.com/vdp/0dd0015b33fb49d4b3c1aa5c17ec497c?ref=dp_vse_rvc_0", "user_id": ""}], "store": "Square Enix", "categories": ["Video Games", "PC", "Games"], "details": {"Release date": "December 2, 2002", "Best Sellers Rank": {"Video Games": 154968, "PC-compatible Games": 8771}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "24 x 24 x 24 inches; 3.99 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "KONAMI GAMES", "Is Discontinued By Manufacturer": "Yes", "Item Weight": "3.99 ounces", "Manufacturer": "Konami", "Date First Available": "December 13, 2002"}, "parent_asin": "B00006969T", "bought_together": null}

[40 points] Tasks on Spark DataFrame

NOTE:

Proof of implementation

Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.

Note: In this list of exercises, you have to use the pyspark.sql.DataFrame API. You need to perform the analysis using Spark DataFrame API.

  1. Build a proper schema for both datasets and use the schema to read the dataset files. The proper schema should consider the datatype for each column as mentioned in the description and do not store them as strings. Read the dataframes given the schema.

Proof of implementation

  1. Show the schema of the dataframes after creating them.
  2. Show the first 10 rows of each dataframe.
  1. You must take a 10% sample of the products dataframe and use it for analysis where your SID will be utilized as seed for the random state of the pyspark.sql.DataFrame.sample function. The shared notebook contains the code snippets to take the sample as follows:
# df_products is the dataframe after successfully reading it from the file products.jsonl using the schema
# Rename some columns
df_products = df_products.withColumnsRenamed(
    {
      "title": "product_title",
      "images": "product_images"
    }
)


df_products_sample = df_products.sample(withReplacement=False, fraction = 0.1, seed = SID)

# Now you can delete the original dataframe
del df_products 
# df_reviews is the dataframe after successfully reading it from the file reviews.jsonl using the schema
# Rename some columns
df_reviews = df_reviews.withColumnsRenamed(
    {
      "title": "review_title",
      "text": "review_text",
      "timestamp": "review_timestamp",
      "images": "reviewed_product_images"
     }
)


cols = df_reviews.columns

# Find the reviews of this dataframe
df_reviews_sample = df_reviews.join(df_products_sample, on='parent_asin',how='inner').select(cols)

# Now you can delete the original dataframe
del df_reviews


# Now you must use only these samples for all analysis tasks.
  1. We want to understand whether average ratings changes by product category. Analyze the average ratings for products within each main category. Identify categories with the best ratings and worst ratings.

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the top 10 product categories with the best ratings
  3. Show the top 10 product categories with the worst ratings
  4. Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
  1. Investigate if there is a linear correlation between the price of a product and the average ratings. This will help us to observe the impact of price on ratings. We consider both variables on ratio scale.

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the top 10 expensive products with distribution of average ratings.
  3. Show the top 10 cheap products with distribution of average ratings.
  4. Add your conclusion/recommendation of this analysis result to the notebook as markdown text.
  1. Here, we will analyze product release trends. You can get the product release dates from the field details. Explore the distribution of product release dates. Are there certain years or periods (seasons or months) with a higher number of releases? Are there any trends in the categories of products released over time?

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the analysis results for the top 10 products and product categories
  3. Add your conclusion/recommendation of this analysis result to the notebook as markdown text.
  1. We want to find similar products based only on the description. For that we want to extract some structured information (e.g. keywords or tags) from the description. You can use different methods to extract this structured info from the description of the product. You can use external packages here but notice that the library you use here must support working with Spark DataFrames and it is not accepted to use libraries that do not use Spark DataFrames. Define a python function extract_info which will return the top structured info (list or array of textual info) for a specific product description (text).

Proof of implementation

  1. Explain in your own words, how you will implement this task.
  2. Show the top structured info for the first product along with its description.
  3. Add your conclusion to the notebook as markdown text.
  1. Use the previous function extract_info and define a UDF for extracting structured info from the field description that can be used for analysis. Use the UDF to identify similar products based on matching keywords extracted from description field. How you aggregate the keywords extracted from each product description is up to you. For example, you can concatenate them to form a single string for each list of keywords extracted per product description. You can use Levenshtein distance for string matching. Check here for the documentation. Is the extracted info useful to find similar products? Specify the benefits and drawbacks of using this approach.

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Take a sample of 20 products of the dataframe using the function pyspark.sql.DataFrame.sample.
  3. Select the first product (first record) x in the selected sample.
  4. Find top 20 similar products for x.
  5. Show the description and string matching scores for these 20 similar products.
  6. Add your conclusion of this analysis result and your answers to the notebook as markdown text.

[15 points] Tasks on Spark SQL

NOTE:

Proof of implementation

Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.

Note: In this list of exercises, you have to use the pyspark.sql API. You need to perform the analysis using Spark SQL queries. In case you needed to apply some operation that is not supported in Spark SQL queries then you can use Spark DataFrame API to apply only these operations.

  1. We are interested in analyzing each store in the dataset. Find the distribution of products per store. The distribution of products here means the total number of products identified using asin.

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the analysis results for the top 10 stores.
  3. Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
  1. We want to identify top reviewers and their preferences. Find the reviewers who have written the most reviews and analyze the distribution of ratings in their reviews. Do they tend to rate higher or lower? Do they specialize in certain brands?

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the analysis results for the top 10 reviewers
  3. Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
  1. Investigate if there’s a relationship between the length of a review (reviewText) and its rating. Are longer reviews more likely to have lower/higher ratings?

Proof of implementation

  1. Explain in your own words, how you will implement this analysis task.
  2. Show the analysis results for the top 10 reviews with longest text and their distribution of higher/lower ratings.
  3. Add your conclusion of this analysis result to the notebook as markdown text.

[45 points] Tasks on Spark ML library

Note: You have to use here only Pyspark APIs. Using other packages which operate on single machines or cores are not allowed.

Note: While you train the models, please take a look at Spark UI to track your Spark app.**

  1. Write Python functions to generate non-personalized recommendations where you need to suggest/retrieve the following products for any user.
    • Top-10 rated products
    • Top-10 most recently published products
    • Top-10 most most helpful reviews
    • Top-10 cheapest video games

Proof of implementation

  1. Run each function only once and print the output.
  1. Feature Encoding & Regression task: Here you need to predict the price of the product from other features.
    • You have to use products dataframe.
    • Preprocess the data and encode the non-numerical features.
      • For encoding long text fields, use TF-IDF or Word2Vec with embedding size not less than 20 (you can find them in pyspark.ml.feature). You can see some example here.
      • For encoding categorical fields, you can use One-hot-encoder or other proper encoders as you prefer.
      • For numerical fields like price, remove the missing values. price is the output column and you should not alter it. You take into consideration only the products whose price is known. For other numerical features, you encode as suggested.
      • For array/list or dict type fields, you can remove them if you want (for simplicity) OR you can create a row for each value in the array. You can use explode function for that purpose. See example here. Then you can encode with one-hot encoder or any other encoder.
      • For date/timestamp field, decompose it into three parts (day, month and year), drop the day part (and also the hour, minute and seconds parts) and keep only month and year parts, then build a custom pyspark.ml.Transformer for the month part whereas the year part can be treated as a numerical feature. Check this example to see how you can build a custom transformer. For month part of the date field, use the sin and cos transformation (one of the common ways to encode cyclical features). This transformation encodes each input into two components, one component on sin wave and the other one on cos wave. After encoding, you will get two columns for month as follows (monthsin, monthcos). It is given by
        monthsin=pyspark.sql.functions.sin(2math.pimonth12)
        monthcos=pyspark.sql.functions.cos(2math.pimonth12)
      • Do not forget to scale the data if necessary.
        Note: You can use scalers in pyspark.ml.feature to scale the data or any additional methods to increase the performance of the system.
      • Split the dataset into 70% training and 30% test.
      • Build an estimator of your choice for predicting the price (use only train data).
      • Evaluate the performance of the estimator using rmse metric (use only test data).
      • Perform a cross validation using Grid search with 3 folds to tune the hyperparameters. The Grid must contain at least 2 hyperparameters and each hyperparameter should be optimized using a list of 2 candidates at least. At minimum, you should have 2×2=4 combinations and within each combination you have 3 cross validations. The best model’s performance here should be improved in comparison to the first model’s performance.
      • Evaluate the performance of the best estimator using rmse metric (use only test data).
      • For one of the video games you select, predict the price using the optimized model.

Proof of implementation

  1. Show the first 10 samples of products dataframe before encoding.
  2. Show the first 10 samples of products dataframe after encoding. Use vertical=True and Truncate=False.
  3. Show the size of the train and test dataframes (only number of rows).
  4. Show the value of rmse for the first model (before optimization).
  5. Show the value of the hyperparameters/settings before and after optimization.
  6. Show the value of rmse for the first model (after optimization).
  7. Show the predicted price of one game you select. Also show the original price of the game.
  8. Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
  1. Matrix factorization based collaborative filtering: Here you need to implement a matrix factorization method and use it for generating recommendations.
    • Use ALS algorithm from pyspark.ml.recommendation to fit the rating data.
    • Split the data into 80% training and 20% test set. Do not forget to seed the random state.
    • Find top-5 personalized recommendations (recommended products) for the first 10 users.
    • Evaluate the performance of the system using meanAveragePrecision for the top-5 results. Use the test data.
    • Test 3 different settings for ALS algorithm and add add your observations as comments to your code.
    • Select the best model and calculate its performance on the test data.

Proof of implementation

  1. Show the size of the train and test dataframes (only number of rows).
  2. Show the first 10 samples of rating train data.
  3. Show the first 10 unique users from your dataset.
  4. Show top-5 recommended products for the first 10 users. You should show all records of the dataframe.
  5. Show the value of meanAveragePrecision for the k=5.
  6. Show the 3 different settings for ALS algorithm.
  7. Show the value of meanAveragePrecision for each of the 3 different settings.
  8. Show the performance meanAveragePrecision of the best model.
  9. Add your conclusion of this analysis result to the notebook as markdown text.
THE END