大数据代写Programming for Big Data

这个作业要求使用Hadoop Streaming API来处理大数据。Hadoop是一个大数据处理框架,源代码使用JAVA写成,但也可以通过streaming使用其它编程语言来完成MapReduce大数据处理任务。本题共有5个子问题,其中一些子问题需要拆分成多个MapReduce任务。数据是用的yelp dataset challenge。

You must use Hadoop (Map/Reduce Java or Python, or Pig, with Spark as extra credit) to analyze the Yelp data challenge: https://www.yelp.com/dataset_challenge.
Specifically, you must provide the answers (and code) to the 5 following questions:
1. Summarize the number of reviews by US city, by business category.
2. Rank all cities by # of stars descending, for each category
3. What is the average rank (# stars) for businesses within 20 miles of the University of Wisconsin - Madison, by type of business?
Center: University of Wisconsin - Madison Latitude: 43 04’ 30” N, Longitude: 89 25’ 2” W Decimal Degrees: Latitude: 43.0766, Longitude: -89.4125
The bounding box for this problem is ~20 miles, which we will loosely define as 20 minutes. So the bounding box is a square box, 40 minutes long each side (of longitude and latitude), with UWM at the center.
4. Rank reviewers by number of reviews. For the top 10 reviewers, show their average number of stars, by category.
5. For the top 10 and bottom 10 food business near UWM (in terms of stars), summarize star rating for reviews in June through December.

kamisama wechat
KamiSama