Course: Big Data - IU S25
Author: Firas Jolha
In this assignment, you will work on Colab and you must use the notebook template shared with you above. In the template, you can see how it installs PySpark and publishes the Spark UI using localtunnel. Here, you work on a Colab notebook and you need to submit only the notebook as .ipynb
to Moodle.
Important notes:
Note: In this assignment, you do not work on the whole dataset files. You must take a 10% sample of the products dataframe and use it for analysis where your SID will be utilized as seed for the random state of the pyspark.sql.DataFrame.sample
function. The shared notebook contains the code snippets to take the sample as follows:
# df_products is the dataframe after successfully reading it from the file products.jsonl using the schema
# Rename some columns
df_products = df_products.withColumnsRenamed(
{
"title": "product_title",
"images": "product_images"
}
)
df_products_sample = df_products.sample(withReplacement=False, fraction = 0.1, seed = SID)
# Now you can delete the original dataframe
del df_products
# df_reviews is the dataframe after successfully reading it from the file reviews.jsonl using the schema
# Rename some columns
df_reviews = df_reviews.withColumnsRenamed(
{
"title": "review_title",
"text": "review_text",
"timestamp": "review_timestamp",
"images": "reviewed_product_images"
}
)
cols = df_reviews.columns
# Find the reviews of this dataframe
df_reviews_sample = df_reviews.join(df_products_sample, on='parent_asin',how='inner').select(cols)
# Now you can delete the original dataframe
del df_reviews
# Now you must use only these samples for all analysis tasks.
This assignment will be dedicated to practise on Apache Spark SQL and machine learning libraries. For this task, you can use the pyspark.sql
, the legacy RDD-based MLlib in pyspark.mllib
and DataFrame-based MLlib pyspark.ml
for Python. You have to work on tasks dedicated for data analysis using Spark SQL and predictive analysis using different ML tasks on the same dataset. You also need to train a recommendation system for suggesting products to Amazon users. You have to work on a sample of Amazon products dataset.
The dataset includes Amazon users who rated products and added reviews. Check the description of the datasets below.
This dataset consists of two jsonl
files. reviews.jsonl
contains users’ reviews for products on Amazon. products.jsonl
contains the product metadata. The description of fields in reviews.jsonl
as follows:
You can see below a sample of 10 records of the file reviews.jsonl
{"rating": 4.0, "title": "It\u2019s pretty sexual. Not my fav", "text": "I\u2019m playing on ps5 and it\u2019s interesting. It\u2019s unique, massive, and has a neat story. People are freaking out angry about this game. I don\u2019t think it\u2019s a top 10 game but it\u2019s definitely a good game on ps5 (played at launch).", "images": [], "asin": "B07DJWBYKP", "parent_asin": "B07DK1H3H5", "user_id": "AGCI7FAH4GL5FI65HYLKWTMFZ2CQ", "timestamp": 1608186804795, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Good. A bit slow", "text": "Nostalgic fun. A bit slow. I hope they don\u2019t stretch it out too far. It\u2019s good tho", "images": [], "asin": "B00ZS80PC2", "parent_asin": "B07SRWRH5D", "user_id": "AGCI7FAH4GL5FI65HYLKWTMFZ2CQ", "timestamp": 1587051114941, "helpful_vote": 1, "verified_purchase": false}
{"rating": 5.0, "title": "... an order for my kids & they have really enjoyed playing this PC game", "text": "This was an order for my kids & they have really enjoyed playing this PC game.", "images": [], "asin": "B01FEHJYUU", "parent_asin": "B07MFMFW34", "user_id": "AGXVBIUFLFGMVLATYXHJYL4A5Q7Q", "timestamp": 1490877431000, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Great alt to pro controller", "text": "These work great, They use batteries which is a bummer, but for the 40 less that i paid its worth it. Batteries last a long time. Have been using to play rocket league on the switch with no issues", "images": [], "asin": "B07GXJHRVK", "parent_asin": "B0BCHWZX95", "user_id": "AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q", "timestamp": 1577637634017, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "solid product", "text": "I would recommend to anyone looking to add just a little bit of height and a lot of grip to their thumb sticks. These will not create miracles, but it will give you better leverage for shooters.", "images": [], "asin": "B00HUWA45W", "parent_asin": "B00HUWA45W", "user_id": "AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q", "timestamp": 1427591932000, "helpful_vote": 0, "verified_purchase": true}
{"rating": 3.0, "title": "love all the amazing colors but the black is really ...", "text": "love all the amazing colors but the black is really hard to see and I always have to have another form of a light on to use the key board now", "images": [], "asin": "B016Y2BVKA", "parent_asin": "B073SC6V1D", "user_id": "AHXSBZT52TCPZUBVCBRICTHWUCBA", "timestamp": 1518124539574, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "Will use again", "text": "Instant delivery!", "images": [], "asin": "B004RMK57U", "parent_asin": "B004RMK57U", "user_id": "AHZIJGKEWRTAEOZ673G5B3SNXEGQ", "timestamp": 1602937709361, "helpful_vote": 0, "verified_purchase": true}
{"rating": 5.0, "title": "if its prime day and you're contemplating, just stop -- do it.", "text": "you should probably get these. you'll journal about the next-level experience, or at least sit there and revel for a couple min. honestly, it's like you've never heard footsteps before. or windows systems settings. they recommend against using 7.1 surround sound on microsoft teams and skype -- but nobody said SMACK about zoom. these professors be twich streamin asmr evry time they take a drink of their 'make me talky' water. sometimes it blocks out my boyfriends nonsense caveman grumblins while hes playing rocket league and im on the sims, or i guess you can turn up the ambience and pretend you cant hear someone unironically ironically screaming POGGERS every 3 min. great deal on prime day btw. now, razer, or amazon, whoever comes first, just keep the discounts for the next year please so I can sell my last kidney to afford the rest of my pc build. thanks. cheers xx<br /><br />ps. half-joking about the selling my organs thing and poggers thing. kind of. hook it up for ur fav broke AF college graphic designer, would you pleaseee. :) <3", "images": [{"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/41bzyynuwTL._SL1600_.jpg", "attachment_type": "IMAGE"}, {"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/7104ErJuizL._SL1600_.jpg", "attachment_type": "IMAGE"}, {"small_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL256_.jpg", "medium_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL800_.jpg", "large_image_url": "https://images-na.ssl-images-amazon.com/images/I/71IRNwL2DkL._SL1600_.jpg", "attachment_type": "IMAGE"}], "asin": "B07N85FY1G", "parent_asin": "B0BYVN9ZK2", "user_id": "AFO6QN6ICKWUFQV3UEWK5EECIQTQ", "timestamp": 1602718512453, "helpful_vote": 0, "verified_purchase": false}
{"rating": 5.0, "title": "Price bumps it up from 4 stars", "text": "*it fits TWO wired Retro-bit 6 button controllers. Yes, the ones with the 8 ft cables.<br />Plus, a cut-out area for the console, and the factory hdmi and power supply fit it the lid pouch.<br /><br />warning: the stock 3 button controllers won't fit<br /><br />great simple case with good stamp-cut foam padding", "images": [], "asin": "B08L6782X9", "parent_asin": "B08L6782X9", "user_id": "AG6BAEKWLCWH2TW3KKLVK773YF6A", "timestamp": 1621448670253, "helpful_vote": 0, "verified_purchase": true}
{"rating": 1.0, "title": "It's an Auto-renew scam", "text": "Sony and Amazon are collaborating in an Auto-renew scam<br /><br />Buying this turns on Auto-renew allowing Sony to charge double the annual fee<br /><br />Sony is exploiting the financially challenged, Amazon gets kickbacks, and people who deserved to be burned alive are unscathed.", "images": [], "asin": "B017V6YVDC", "parent_asin": "B017V6YVDC", "user_id": "AG6BAEKWLCWH2TW3KKLVK773YF6A", "timestamp": 1607734474794, "helpful_vote": 2, "verified_purchase": true}
The description of fields in products.csv
as follows:
You can see below a sample of 10 records of the file products.jsonl
{"main_category": "Video Games", "title": "Dash 8-300 Professional Add-On", "average_rating": 5.0, "rating_number": 1, "features": ["Features Dash 8-300 and 8-Q300 ('Q' rollout livery)", "Airlines - US Airways, South African Express, Bahamasair, Augsburg Airways, Lufthansa Cityline, British Airways (Union Jack), British European, FlyBe, Intersky, Wideroe, Iberia, Tyrolean, QantasLink, BWIA", "Airports include - London City, Frankfurt, Milan and Amsterdam Schipol", "Includes PSS PanelConfig and LoadEdit tools"], "description": ["The Dash 8-300 Professional Add-On lets you pilot a real commuter special. Fly two versions of the popular Dash 8-300 in a total of 17 different liveries. The Dash 8-300 is one of the most popular short-haul aircraft available and this superbly modelled version from acclaimed aircraft developers PSS is modelled in two versions with a total of 17 different liveries. The package also includes scenery for three European airports, tutorials, tutorial flights and utilities together in one fantastic package."], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/21DVWE41A0L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/21DVWE41A0L.jpg", "variant": "MAIN", "hi_res": null}], "videos": [], "store": "Aerosoft", "categories": ["Video Games", "PC", "Games"], "details": {"Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "7.5 x 5.5 x 0.6 inches; 4.8 Ounces", "Type of item": "CD-ROM", "Rated": "Everyone", "Item Weight": "4.8 ounces", "Manufacturer": "Aerosoft N.A. LTD", "Date First Available": "October 2, 2001"}, "parent_asin": "B000FH0MHO", "bought_together": null}
{"main_category": "Video Games", "title": "Phantasmagoria: A Puzzle of Flesh", "average_rating": 4.1, "rating_number": 18, "features": ["Windows 95"], "description": [], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/51pqAznTA9L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51pqAznTA9L.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/71hD-k6kaxL._SL1101_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/61CCFhIg4qL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/61CCFhIg4qL.jpg", "variant": "PT01", "hi_res": "https://m.media-amazon.com/images/I/81dGuRrFwAL._SL1104_.jpg"}], "videos": [], "store": "Sierra", "categories": ["Video Games", "PC", "Games"], "details": {"Best Sellers Rank": {"Video Games": 137612, "PC-compatible Games": 6707}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "5.6 x 4.9 x 0.9 inches; 6.4 Ounces", "Type of item": "CD-ROM", "Rated": "Mature", "Is Discontinued By Manufacturer": "No", "Item Weight": "6.4 ounces", "Manufacturer": "Sierra", "Date First Available": "March 30, 2006"}, "parent_asin": "B00069EVOG", "bought_together": null}
{"main_category": "Video Games", "title": "NBA 2K17 - Early Tip Off Edition - PlayStation 4", "average_rating": 4.3, "rating_number": 223, "features": ["The #1 rated NBA video game simulation series for the last 15 years (Metacritic).", "The #1 selling NBA video game simulation series for the last 9 years (NPD).", "Over 85 awards and nominations since the launch of PlayStation 4 & Xbox One.", "BEST IN CLASS GAMEPLAY - 2K puts shot making in your hands like never before. Advanced Skill Shooting gives you complete control over the power and aim of your perimeter shots as well as your ability to finish inside the paint.", "THE PRELUDE - Begin your MyCAREER on one of 10 licensed collegiate programs, available for free download one week prior to launch!", "MyCAREER - It\u2019s all-new and all about basketball in 2K17 \u2013 and you\u2019re in control. Your on-court performance and career decisions lead to different outcomes as you determine your path through an immersive new narrative, featuring Michael B. Jordan. Additionally, new player controls give you unparalleled supremacy on the court.", "USA BASKETBALL - Take the court as Team USA with Coach K on the sidelines, or relive the glory of the \u201992 Dream Team. Earn USAB MyTEAM cards and gear up your MyPLAYER with official USAB wearables.", "COLLEGE INTEGRATION - For the first time, play as college basketball legends with each school\u2019s all-time greats team and MyTEAM cards.", "LEAGUE EXPANSION - For the first time, customize your MyLEAGUE and MyGM experience with league expansion. Choose your expansion team names, logos and uniforms, and share them with the rest of the NBA 2K community. Your customized league comes complete with everything from Expansion Drafts to modified schedules and more to ensure an authentic NBA experience.", "2K BEATS Imagine Dragons, Grimes, Noah \u201c40\u201d Shabib of OVO Sound and Michael B. Jordan curate another electric 2K soundtrack, featuring 50 songs."], "description": ["Following the record-breaking launch of NBA 2K16, the NBA 2K franchise continues to stake its claim as the most authentic sports video game with NBA 2K17. As the franchise that \u201call sports video games should aspire to be\u201d (GamesRadar), NBA 2K17 will take the game to new heights and continue to blur the lines between video game and reality."], "price": 58.0, "images": [{"thumb": "https://m.media-amazon.com/images/I/51wlIPcf0gL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51wlIPcf0gL.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/81MtBG4xXhL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51smI92XGdL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51smI92XGdL.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/81fby40wGQL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41B2Li+r-6L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41B2Li+r-6L.jpg", "variant": "PT05", "hi_res": "https://m.media-amazon.com/images/I/71VFYALs8qL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/518ADy9h+wL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/518ADy9h+wL.jpg", "variant": "PT06", "hi_res": "https://m.media-amazon.com/images/I/71ZbTa0QT4L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41EsOEFBg0L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41EsOEFBg0L.jpg", "variant": "PT07", "hi_res": "https://m.media-amazon.com/images/I/71kIQMjnwWL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41JUeBKY2EL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41JUeBKY2EL.jpg", "variant": "PT08", "hi_res": "https://m.media-amazon.com/images/I/71RyXZvZA2L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41aPDXtqZxL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41aPDXtqZxL.jpg", "variant": "PT09", "hi_res": "https://m.media-amazon.com/images/I/71PdzeowO3L._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411-B81va7L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411-B81va7L.jpg", "variant": "PT10", "hi_res": "https://m.media-amazon.com/images/I/81Anx48IQUL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41g1KDskmML._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41g1KDskmML.jpg", "variant": "PT11", "hi_res": "https://m.media-amazon.com/images/I/71mrJKrLwJL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411DcBq41HL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411DcBq41HL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/61CSnSRFIJL._SL1500_.jpg"}], "videos": [{"title": "NBA 2K17 - Kobe: Haters vs Players", "url": "https://www.amazon.com/vdp/386e44f88d0f41d99714076c93459753?ref=dp_vse_rvc_0", "user_id": ""}], "store": "2K", "categories": ["Video Games", "PlayStation 4", "Games"], "details": {"Release date": "September 16, 2016", "Best Sellers Rank": {"Video Games": 57637, "PlayStation 4 Games": 2886}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "0.4 x 5.3 x 6.6 inches; 1.6 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "47793", "Is Discontinued By Manufacturer": "No", "Item Weight": "1.6 ounces", "Manufacturer": "2K Games", "Date First Available": "April 13, 2016"}, "parent_asin": "B00Z9TLVK0", "bought_together": null}
{"main_category": "Video Games", "title": "Nintendo Selects: The Legend of Zelda Ocarina of Time 3D (Renewed)", "average_rating": 4.9, "rating_number": 22, "features": ["Authentic Nintendo Selects: The Legend of Zelda Ocarina of Time 3D", "Does not come with original case or manuals. Cartridge only", "Cartridge and label are in nice condition", "Fully tested and guaranteed"], "description": [], "price": 37.42, "images": [{"thumb": "https://m.media-amazon.com/images/I/51raO0wAe8L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51raO0wAe8L.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/81dM82yx6wL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51ag4Lai25L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51ag4Lai25L.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41D7zacd5cL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41D7zacd5cL.jpg", "variant": "PT02", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41sNSMvZGAL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41sNSMvZGAL.jpg", "variant": "PT03", "hi_res": null}], "videos": [], "store": "Amazon Renewed", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Nintendo 3DS & 2DS", "Games"], "details": {"Best Sellers Rank": {"Video Games": 51019, "Nintendo 3DS & 2DS Games": 432}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "0.5 x 5.4 x 4.9 inches; 2.05 Ounces", "Type of item": "Video Game", "Rated": "Everyone 10+", "Is Discontinued By Manufacturer": "No", "Item Weight": "2.04 ounces", "Manufacturer": "Nintendo", "Date First Available": "June 14, 2019"}, "parent_asin": "B07SZJZV88", "bought_together": null}
{"main_category": "Video Games", "title": "Thrustmaster Elite Fitness Pack for Nintendo Wii", "average_rating": 3.0, "rating_number": 3, "features": ["Includes (9) Total Accessories", "Pedometer", "Wii Fit Balance Board Stepper", "Floor Mat made from high density Woven Foam", "(2) flexible Arm/Leg Weights"], "description": ["The Thrustmaster Motion Plus Elite Fitness Pack for Wii is Ideal for Nintendo Wii Fit & Wii Fit Plus games such as EA Active (EA), U Shape & My Fitness Coach (UbiSoft). Ultimate pack with 9 accessories for the Nintendo Wii Fit Balance Board including: (1) floor mat made from woven foam Size (inches) : 70x20, (2) flexible ankle or wrist weights, (1) stepper for the Wii Balance Board, (1) pedometer to count your steps during each in-game exercise or train without games, (1) armband for Wiimote or Wiimote with MotionPlus - for use in menus without stepping away from the training area, (1) leg band for the Nunchuck controller (1) lanyard to your MP3 player around your neck and (1) carry bag featuring trendy sport design with two internal pockets: ideal for storing, protecting and transporting your Wii Balance Board with one game and ALL Elite Fitness pack accessories! 2 Year Warranty!"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/31spO9JKluL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31spO9JKluL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31rX51CAv5L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31rX51CAv5L.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31rNKdcmUQL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31rNKdcmUQL.jpg", "variant": "PT02", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31xkaqpKS+L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31xkaqpKS+L.jpg", "variant": "PT03", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/314yku+VEIL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/314yku+VEIL.jpg", "variant": "PT04", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41ezKm50QUL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41ezKm50QUL.jpg", "variant": "PT05", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31u0pX5P+BL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31u0pX5P+BL.jpg", "variant": "PT06", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31byOVYmfUL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31byOVYmfUL.jpg", "variant": "PT07", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/3160QI-NIXL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/3160QI-NIXL.jpg", "variant": "PT08", "hi_res": null}], "videos": [], "store": "THRUSTMASTER", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Wii", "Accessories", "Fitness Accessories"], "details": {"Release date": "November 1, 2009", "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "19.8 x 14 x 4.75 inches; 7.35 Pounds", "Type of item": "Video Game", "Language": "English, French", "Item model number": "4660368", "Item Weight": "7.35 pounds", "Manufacturer": "Thrustmaster VG", "Date First Available": "September 11, 2009"}, "parent_asin": "B002WH4ZJG", "bought_together": null}
{"main_category": "Video Games", "title": "Grand Prix 4", "average_rating": 3.7, "rating_number": 18, "features": ["Operating System: Windows 98, ME, XP", "Developer: Geoff Crammond", "Publisher: Infogrames", "Genre: Formula 1"], "description": ["Grand Prix 4 is the fourth installment in the GP-Series by Geoff Crammond. However, this time around, Microprose had a much larger role in the development of the game. For the first time in the Grand-Prix-Series, Grand Prix 4 had to go directly head-to-head with an EA-Sports offering, in the form of F1 2002. The flame wars and heated debate about which is better has not stopped since. Grand Prix 4 is not a revolution over Grand Prix 3, but more an evolution, as was Grand Prix 3 over Grand Prix 2. Today the game still lives on with plenty of new downloads and updates available to bring the game up to the latest season."], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/51167J4AQSL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51167J4AQSL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/5128X6W23SL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/5128X6W23SL.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/51XE41TZXDL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51XE41TZXDL.jpg", "variant": "PT03", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/51PE3CA5B7L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51PE3CA5B7L.jpg", "variant": "PT04", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [], "store": null, "categories": ["Video Games", "PC", "Games"], "details": {"Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Package Dimensions": "7.64 x 5.43 x 1.42 inches; 7.04 Ounces", "Type of item": "Video Game", "Is Discontinued By Manufacturer": "No", "Item Weight": "7 ounces", "Date First Available": "November 28, 2011"}, "parent_asin": "B00005Y3OJ", "bought_together": null}
{"main_category": "Video Games", "title": "Spongebob Squarepants, Vol. 1", "average_rating": 4.4, "rating_number": 32, "features": ["Bubblestand: SpongeBob shows Patrick and Squidward his unique talent for blowing bubbles. Squidward attempts to surpass SpongeBobs expertise but not everything goes according to plan!", "Ripped Pants: When SpongeBob tries to impress Sandy Cheeks at Mussel Beach, he accidentally rips his pants. The beach crowd loves the unintentional joke until SpongeBob pushes it too far.", "Jellyfishing: While Squidward is recovering from an accident, SpongeBob and Patrick take him jellyfishing. But the two unwittingly thrust poor Squidward into the hazardous rigors of their favorite pastime.", "Plankton: When evil Plankton takes control of his brain, SpongeBob must fight his own body to prevent himself from revealing the top secret Krabby Patty recipe!"], "description": ["Now you can watch the wild underwater antics of SpongeBob SquarePants on your Game Boy Advance, with this collection of 4 great episodes. In \"Hall Monitor,\" an overzealous SpongeBob becomes the new Hall Monitor at Mrs. Puff's Boating School, and extends his jurisdiction to the unsuspecting citizens of Bikini Bottom. In \"Jellyfish Jam,\" SpongeBob takes a wild jellyfish home and discovers that they multiply quickly when they take over his house! In \"Jellyfishing,\" SpongeBob and Patrick unwittingly thrust poor Squidward, who is recovering from an accident, into the hazardous rigors of their favorite pastime, jellyfishing. In \"Plankton,\" evil Plankton takes control of SpongeBob's brain, and SpongeBob must fight his own body to prevent himself from revealing the top secret Krabby Patty recipe!"], "price": 33.98, "images": [{"thumb": "https://m.media-amazon.com/images/I/611XSBQ4RBL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/611XSBQ4RBL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [], "store": "Majesco", "categories": ["Video Games", "Legacy Systems", "Nintendo Systems", "Game Boy Systems", "Game Boy Advance", "Games"], "details": {"Release date": "August 15, 2004", "Best Sellers Rank": {"Video Games": 47190, "Game Boy Advance Games": 142}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "5.75 x 4 x 1 inches; 0.32 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "GAMJC 096427013426", "Is Discontinued By Manufacturer": "Yes", "Item Weight": "0.32 ounces", "Manufacturer": "Majesco Sales Inc.", "Date First Available": "June 1, 2004"}, "parent_asin": "B0001ZNU56", "bought_together": null}
{"main_category": "Video Games", "title": "eXtremeRate Soft Touch Top Shell Front Housing Faceplate Replacement Parts with Side Rails Panel for Xbox One X & One S Controller (Shadow Purple)", "average_rating": 4.5, "rating_number": 3061, "features": ["Compatibility Models: Ultra fits for Xbox One X & One S controller ; Not compatible with Xbox One Elite controller & Standard Xbox One Controller.Check the second picture of the listing before purchase", "Fit Perfectly: Fit the best by far; Completely fits flush on all side; Sit properly on all the clips", "Package Includes: 1 * Faceplate shell ; 1 * Right side rails; 1 * Left side rails;1*Open Shell Tool;1 * T8H screwdriver; 7* screws. (IMPORTANT: The controller and other parts are not included.)", "Installation Skills Required: Required customers to take apart the controller to install this front housing shell; Required customers handy with controller modifications", "Personalized Feature: The shadow purple color looks great; Great smooth grip, soft in hand and feels silky; Anti slip, sweat free for a long period game playing"], "description": [], "price": 17.59, "images": [{"thumb": "https://m.media-amazon.com/images/I/41reviTraVL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41reviTraVL.jpg", "variant": "MAIN", "hi_res": "https://m.media-amazon.com/images/I/711FCClp7qL._SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/51Y2sNaPYuL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/51Y2sNaPYuL.jpg", "variant": "PT01", "hi_res": "https://m.media-amazon.com/images/I/61WZB0JSbfL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41yHhIt5rDL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41yHhIt5rDL.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/517Xy-auRHL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41SlaNfN4fL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41SlaNfN4fL.jpg", "variant": "PT03", "hi_res": "https://m.media-amazon.com/images/I/61oqzfBYbVL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41+ZJYYRKVL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41+ZJYYRKVL.jpg", "variant": "PT04", "hi_res": "https://m.media-amazon.com/images/I/61gRCbvxyBL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41DKBJyb-EL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41DKBJyb-EL.jpg", "variant": "PT05", "hi_res": "https://m.media-amazon.com/images/I/61oHubD2jhL._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/418wNkXtQ2L._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/418wNkXtQ2L.jpg", "variant": "PT06", "hi_res": "https://m.media-amazon.com/images/I/61fUe2N9f8L._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/411+i-oIjIL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/411+i-oIjIL.jpg", "variant": "PT07", "hi_res": "https://m.media-amazon.com/images/I/516v2K0Aq9L._SL1000_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41dEWqxkoWL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/41dEWqxkoWL.jpg", "variant": "PT08", "hi_res": "https://m.media-amazon.com/images/I/51F+pvO+NfL._SL1000_.jpg"}], "videos": [], "store": "eXtremeRate", "categories": ["Video Games", "Xbox One", "Accessories", "Faceplates, Protectors & Skins"], "details": {"Best Sellers Rank": {"Video Games": 48130, "Xbox One Faceplates, Protectors & Skins": 253}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "6.5 x 4.53 x 1.97 inches; 1.76 Ounces", "Type of item": "Video Game", "Item model number": "ZSXOF0", "Is Discontinued By Manufacturer": "No", "Item Weight": "1.76 ounces", "Manufacturer": "Extremerate", "Country of Origin": "China", "Date First Available": "September 11, 2017"}, "parent_asin": "B07H93H878", "bought_together": null}
{"main_category": "Computers", "title": "Set of 4 Bullet Buttons Nickel+Brass for Playstation PS3 PS2 controllers", "average_rating": 4.8, "rating_number": 4, "features": ["Case Color: Silver", "Case Material: Nickel", "Primer Color: Bronze", "Primer Material: Brass", "Brand of ammo shell may vary from the picture shown, but it will look very similar."], "description": ["A set of 4 bullet button to replace all the 4 buttons (Cross, Square, Triangle, Circle) of your PS3 controller. Made from 9mm luger bullet casings, to perfectly fit the PS3. The bullet buttons are already mounted on a special plastic holder that fit directly and perfectly your PS3 controller. Each cartridge shell have a small layer of clear coat that keep the bullet button ultra shiny even after many years. The clear coat also prevent the metallic odor that can be created during extended use . Case Color: Silver Case Material: Nickel Primer Color: Bronze Primer Material: Brass"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/410NNQMoFEL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/410NNQMoFEL._AC_.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/41ETKho5m-L._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/41ETKho5m-L._AC_.jpg", "variant": "PT01", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/415t4vhHyPL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/415t4vhHyPL._AC_.jpg", "variant": "PT02", "hi_res": "https://m.media-amazon.com/images/I/91sScW8kwwL._AC_SL1500_.jpg"}, {"thumb": "https://m.media-amazon.com/images/I/41EFsAYd8ZL._AC_US40_.jpg", "large": "https://m.media-amazon.com/images/I/41EFsAYd8ZL._AC_.jpg", "variant": "PT03", "hi_res": "https://m.media-amazon.com/images/I/812W6x6a9HL._AC_SL1500_.jpg"}], "videos": [], "store": "NEXiLUX", "categories": ["Video Games", "PlayStation 4", "Accessories", "Controllers"], "details": {"Package Dimensions": "4 x 3 x 0.3 inches", "Item Weight": "1 pounds", "Manufacturer": "NEXiLUX", "Date First Available": "September 17, 2012"}, "parent_asin": "B009C9E8JY", "bought_together": null}
{"main_category": "Video Games", "title": "Konami Collector's Series: Castlevania & Contra - PC", "average_rating": 3.6, "rating_number": 19, "features": ["This collection of classic 8-bit games includes five all-time greats - Castlevania, Castlevania II, Castlevania - Dracula's Curse, Contra and Super C", "Customize controls at any time, during any game", "Quick-save lets players save their progress from any point in the game", "Get back into the games you loved with this collection of old-school classics!"], "description": ["The Konami Collector's Series: Castlevania & Contra brings back the classic games that first appeared on the NES for a new generation of computer gamers!"], "price": null, "images": [{"thumb": "https://m.media-amazon.com/images/I/21CERVWQ5DL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/21CERVWQ5DL.jpg", "variant": "MAIN", "hi_res": null}, {"thumb": "https://m.media-amazon.com/images/I/31BDm2VqflL._SX38_SY50_CR,0,0,38,50_.jpg", "large": "https://m.media-amazon.com/images/I/31BDm2VqflL.jpg", "variant": "GLMR", "hi_res": "https://m.media-amazon.com/images/I/51WzbguWjjL._SL1500_.jpg"}], "videos": [{"title": "Konami Collector's Series: Arcade Advanced", "url": "https://www.amazon.com/vdp/0dd0015b33fb49d4b3c1aa5c17ec497c?ref=dp_vse_rvc_0", "user_id": ""}], "store": "Square Enix", "categories": ["Video Games", "PC", "Games"], "details": {"Release date": "December 2, 2002", "Best Sellers Rank": {"Video Games": 154968, "PC-compatible Games": 8771}, "Pricing": "The strikethrough price is the List Price. Savings represents a discount off the List Price.", "Product Dimensions": "24 x 24 x 24 inches; 3.99 Ounces", "Type of item": "Video Game", "Rated": "Everyone", "Item model number": "KONAMI GAMES", "Is Discontinued By Manufacturer": "Yes", "Item Weight": "3.99 ounces", "Manufacturer": "Konami", "Date First Available": "December 13, 2002"}, "parent_asin": "B00006969T", "bought_together": null}
NOTE:
Proof of implementation
Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.
Note: In this list of exercises, you have to use the pyspark.sql.DataFrame
API. You need to perform the analysis using Spark DataFrame API.
Proof of implementation
- Show the schema of the dataframes after creating them.
- Show the first 10 rows of each dataframe.
pyspark.sql.DataFrame.sample
function. The shared notebook contains the code snippets to take the sample as follows:# df_products is the dataframe after successfully reading it from the file products.jsonl using the schema
# Rename some columns
df_products = df_products.withColumnsRenamed(
{
"title": "product_title",
"images": "product_images"
}
)
df_products_sample = df_products.sample(withReplacement=False, fraction = 0.1, seed = SID)
# Now you can delete the original dataframe
del df_products
# df_reviews is the dataframe after successfully reading it from the file reviews.jsonl using the schema
# Rename some columns
df_reviews = df_reviews.withColumnsRenamed(
{
"title": "review_title",
"text": "review_text",
"timestamp": "review_timestamp",
"images": "reviewed_product_images"
}
)
cols = df_reviews.columns
# Find the reviews of this dataframe
df_reviews_sample = df_reviews.join(df_products_sample, on='parent_asin',how='inner').select(cols)
# Now you can delete the original dataframe
del df_reviews
# Now you must use only these samples for all analysis tasks.
Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the top 10 product categories with the best ratings
- Show the top 10 product categories with the worst ratings
- Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the top 10 expensive products with distribution of average ratings.
- Show the top 10 cheap products with distribution of average ratings.
- Add your conclusion/recommendation of this analysis result to the notebook as markdown text.
details
. Explore the distribution of product release dates. Are there certain years or periods (seasons or months) with a higher number of releases? Are there any trends in the categories of products released over time?Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the analysis results for the top 10 products and product categories
- Add your conclusion/recommendation of this analysis result to the notebook as markdown text.
extract_info
which will return the top structured info (list or array of textual info) for a specific product description (text). Proof of implementation
- Explain in your own words, how you will implement this task.
- Show the top structured info for the first product along with its description.
- Add your conclusion to the notebook as markdown text.
extract_info
and define a UDF for extracting structured info from the field description
that can be used for analysis. Use the UDF to identify similar products based on matching keywords extracted from description
field. How you aggregate the keywords extracted from each product description is up to you. For example, you can concatenate them to form a single string for each list of keywords extracted per product description. You can use Levenshtein distance for string matching. Check here for the documentation. Is the extracted info useful to find similar products? Specify the benefits and drawbacks of using this approach.Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Take a sample of 20 products of the dataframe using the function
pyspark.sql.DataFrame.sample
.- Select the first product (first record) in the selected sample.
- Find top 20 similar products for .
- Show the description and string matching scores for these 20 similar products.
- Add your conclusion of this analysis result and your answers to the notebook as markdown text.
NOTE:
Proof of implementation
Such BLOCK outlines the results of the task that need to be presented as an evidence for grading purposes. They need to be followed as part of the tasks.
Note: In this list of exercises, you have to use the pyspark.sql API. You need to perform the analysis using Spark SQL queries. In case you needed to apply some operation that is not supported in Spark SQL queries then you can use Spark DataFrame API to apply only these operations.
asin
.Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the analysis results for the top 10 stores.
- Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the analysis results for the top 10 reviewers
- Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
reviewText
) and its rating. Are longer reviews more likely to have lower/higher ratings?Proof of implementation
- Explain in your own words, how you will implement this analysis task.
- Show the analysis results for the top 10 reviews with longest text and their distribution of higher/lower ratings.
- Add your conclusion of this analysis result to the notebook as markdown text.
Note: You have to use here only Pyspark APIs. Using other packages which operate on single machines or cores are not allowed.
Note: While you train the models, please take a look at Spark UI to track your Spark app.**
Proof of implementation
- Run each function only once and print the output.
products
dataframe.TF-IDF
or Word2Vec
with embedding size not less than 20 (you can find them in pyspark.ml.feature
). You can see some example here.One-hot-encoder
or other proper encoders as you prefer.price
, remove the missing values. price
is the output column and you should not alter it. You take into consideration only the products whose price is known. For other numerical features, you encode as suggested.explode
function for that purpose. See example here. Then you can encode with one-hot encoder or any other encoder.pyspark.ml.Transformer
for the month part whereas the year part can be treated as a numerical feature. Check this example to see how you can build a custom transformer. For month part of the date field, use the sin and cos transformation
(one of the common ways to encode cyclical features). This transformation encodes each input into two components, one component on sin
wave and the other one on cos
wave. After encoding, you will get two columns for month as follows (, ). It is given by pyspark.ml.feature
to scale the data or any additional methods to increase the performance of the system.rmse
metric (use only test data).rmse
metric (use only test data).Proof of implementation
- Show the first 10 samples of
products
dataframe before encoding.- Show the first 10 samples of
products
dataframe after encoding. Usevertical=True
andTruncate=False
.- Show the size of the train and test dataframes (only number of rows).
- Show the value of
rmse
for the first model (before optimization).- Show the value of the hyperparameters/settings before and after optimization.
- Show the value of
rmse
for the first model (after optimization).- Show the predicted price of one game you select. Also show the original price of the game.
- Add your recommendation/conclusion of this analysis result to the notebook as markdown text.
pyspark.ml.recommendation
to fit the rating data.meanAveragePrecision
for the top-5 results. Use the test data.Proof of implementation
- Show the size of the train and test dataframes (only number of rows).
- Show the first 10 samples of
rating
train data.- Show the first 10 unique users from your dataset.
- Show top-5 recommended products for the first 10 users. You should show all records of the dataframe.
- Show the value of
meanAveragePrecision
for the .- Show the 3 different settings for ALS algorithm.
- Show the value of
meanAveragePrecision
for each of the 3 different settings.- Show the performance
meanAveragePrecision
of the best model.- Add your conclusion of this analysis result to the notebook as markdown text.