Yelp Data Analysis
- Date: Oct 2017
- Category: Data Science
- Key Tags: Big Data, Hadoop, Spark
A project using yelp Dataset to do basic big data analysis.
Extract around 490,000 records on Yelp Dataset which related to restaurants and users. Use Hadoop Map-Reduce to derive some statistics from dataset, such as too 10 average rating restaurants in some specific area. Implement Spark with running a shell script on the same dataset to validate the result and compare the pros and cons of 2 techniques.
Q: List the business_id, full address and categories of the Top 10 businesses using the average ratings.
Q: List the 'user id' and 'rating' of users that reviewed businesses located in “Palo Alto”
Input data files like:
hdfs dfs -put <business.csv> e.g: hdfs dfs -put ~/Documents/input_files/business.csv /parallels/input hdfs dfs -put <review.csv> e.g: hdfs dfs -put ~/Documents/input_files/review.csv /parallels/input hdfs dfs -put <soc-LiveJournal1Adj.txt> e.g: hdfs dfs -put ~/Documents/input_files/soc-LiveJournal1Adj.txt /parallels/input hdfs dfs -put <user.csv> e.g: hdfs dfs -put ~/Documents/input_files/user.csv /parallels/input
Put Source code file in PyCharm, together with data files, then click run. All output result are in output file.