从大量的tweets中寻找trending topic. 在读入文件后,进行去stop words, 统计word count等操作,并要求实现一系列的helper function.

Question 1: Twitter

For this question, you will write two classes to create a new data type that will model information extracted from twitter. Your code for this question will go in two .java files.

We strongly recommend that you complete all the warm-up questions before starting this problem.
Note that in addition to the required methods below, you are free to add as many other private methods as you want.

(a) Tweet Class stores information and behaviours about a tweet. A tweet is a message sent using the social network Twitter. For this assignment, the tweets are defined as messages from 1 to 15 words (after omitting stop words). The Tweet class stores information about a tweet: the user who tweeted it, the date of the tweet, the time of the tweet, and the actual tweet message.

The Tweet class should contain the following private attributes:
• A String userAccount;

This attribute stores the user id.

• A String date;

This attribute stores the date on which the tweet was tweeted.

• A String time;

This attribute stores the time on which the tweet was tweeted.

• A String message;

This attribute stores the tweeted message

The Tweet class should also contain a private static attribute:
• An HashSet stopWords.

This attribute stores the set of stop-words. A stop word is a commonly used word that is ignored by search engines2

Regarding the behaviour of the class, Tweet must implement the following methods:
• A constructor that takes as inputs four Strings. The Strings correspond to:

i) the user account ii) the date on which the tweet was posted. The format of the date is YYYY-MM-DD. iii) the time on which the tweet was posted. The format of time is HH:MM:SS and iv) the message of the tweet. Please notice that your arguments must respect the previous order. The constructor does not have to check the format of the inputs.

• A public non-static checkMessage method which takes no input. This method will check if the message of this Tweet is valid or not. The method returns true if if the message contains less than 16 words and more than 0 (excluding the stop-words). If the HashSet stopWords is equal to null, checkMessage must throw a NullPointerException indicating that the HashSet has not been initialized.
Here a couple of hints.

• getDate

This returns the date instance variable.

• getTime

This returns the time instance variable.

• getMessage

This returns the message instance variable.

• getUserAccount

This returns the userAccount instance variable.

• toString

This method returns a new String that is the concatenation of the userAccount, a tab char- acter, the date, a tab character, the time, a tab character, and the tweeted message in the end.

• isBefore

This method takes as a parameter an instance of Tweet and it returns true when this tweet was posted at an earlier time than the input parameter. Otherwise the method returns false. Remember that the formats of the date and time attributes are YYYY-MM-DD and HH:MM:SS, respectively.

• loadStopWords

This static procedure loadStopWords gets as input the name of a file that contains a set of stop words. The file has only one column containing a stop word in each line. This method should read the file and initialize the static attribute stopWords. This method has no return value.

(b) Twitter Class stores information and behaviours generated by the twitter social network 3.

The Twitter class should contain the following private variable:
• ArrayList tweets

This attribute stores a collection of valid tweets.

Regarding the behaviour of the class, Twitter must implement the following public methods:
• Twitter

This constructor takes no arguments and it initializes tweets as an empty ArrayList.

• loadDB

loadDB takes as argument the name of a file (i.e., its relative path) that contains a collection of tweets. Particularly, we provide to you the file called tweets.txt, which has four columns that are separated by tabs. The first column contains the userAccount, the second and third contain the date and time on which the post was made, respectively. Finally, the fourth column corresponds to the posted message itself. loadDB must read the file line-by-line (please see the warm up questions), construct the corresponding Tweet object, if valid, the Tweet should be added to the ArrayList of Tweets (Hint: remember that you already coded a method to check if a tweet is valid or not). If the tweet is found to be not valid, your program should simply not put it into the ArrayList of Tweets. Once, all the valid tweets are stored in the ArrayList tweets loadDB must call the function sortTwitter to sort the list.

• sortTwitter

sortTwitter takes no arguments and it sorts (in increasing order) the Tweet instances based on their date/time of publication. Here, you are free to select the sorting algorithm. This method has no return value.

• getSizeTwitter

This returns the number of tweets in the data base.

• getTweet

This method receives an index and returns the tweet stored at that index.

• printDB It returns a String that contains all the elements currently stored in the ArrayList tweets. The format of the output must be the same as the one used in the tweets.txt file (i.e., the first column contains the userAccount, the second and third contains the date and time the post was made, respectively. Finally, the fourth is the posted message). (Hint: use the toString method from the Tweet class)

• rangeTweets

rangeTweets gets as parameters two instances of Tweet (i.e., tweet1 and tweet2) and it returns a new ArrayList of Tweets containing those tweet posted between the date/time of tweet1 and tweet2 (inclusive). You can assume that tweet1 and tweet2 are elements of the ArrayList tweets. Please notice that there is not guarantee that tweet1 was posted before tweet2. The method should verify which of those two tweets is the earliest. You can assume that the ArrayList tweets is sorted given that sortTwitter() was already called by loadDB.

• saveDB

saveDB gets one parameters, the name of a file. saveDB must write in the file the list of Tweet following the same format of the tweets.txt file. (Hint: Use the printDB method to obtain the String to write).

• trendingTopic

trendingTopic receives no parameters and it returns the word (that is not a stop word) that is the most frequent in the tweets from the data base. The most common word is chosen from the data base by counting in how many tweet messages the word appears (i.e., if a word appears more than once in a tweet message, it is counted only once).

(c) Main Method

Your main method will not be graded. However, we strongly advise that you write a main method in your class Twitter to test your code. You may include it in your submission as long as it compiles. In the following lines, we will report different versions of our main method. Like that, you will have some test cases to test your code.

• Example 1.

public static void main(String[] args){

The result of running Example 1 is:

Error checking the stopWords database: The file of stopWords has not been loaded yet

Comments: In this example, the function loadDB caught a NullPointerException thrown by checkMessage
• Example 2.

public static void main(String[] args){

The result of running Example 2 is:

The number of tweets is: 58

Comments: The file tweets.txt contains initially 61 tweets. However, 3 tweets where dis- carded because their length was not greater than zero or less than sixteen. Particularly, the message “I can be MADE into a need.” has length zero after subtracting the stop words (i.e., all the words in the message are stop words). On the other hand, the messages “USER 55cc3d0f i’m kidding. i’ll die! lol jk i guess i’d be alright but not for a couple days or hours cause i’m always a happy spirit lol” and “RT USER 3a117437: The woman at the rental car spot tried 2 give us a Toyota! No ma’am lk the old spiritual says “aint got time 2 die!”” were dis- carded because they have 18 and 16 words (after subtracting the stop words [which are shown in bold]), respectively.

• Example 3.

public static void main(String[] args){

The result of running Example 3 is:

USER_a75657c2 2010-03-03 00:02:54 @USER_13e8a102 They reached a compromise ………….
. . . ………….
USER_a75657c2 2010-03-07 21:45:48 So SunChips made a bag that is 100%
biodegradeable. It is about damn time somebody did.

Comments: Here you should see the 58 elements sorted by date/time. We are just showing the first tweet (posted on 2010-03-03 00:02:54) and the last tweet (posted on 2010-03-07)

• Example 4.

public static void main(String[] args){

The result of running Example 4 is:

[USER_1e22f6a5 2010-03-03 00:20:45 RT @USER_561fe280: Nourish your …..,

USER_36607a99 2010-03-03 02:06:01 RT @USER_561fe280: Nourish your ….., USER_c271e4ac 2010-03-03 02:07:37 I want a King Kong roll from Sushi …..]

Comments: Here you should obtain a sorted list of three tweets.

• Example 5.

public static void main(String[] args){

The result of running Example 5 is:


Comments: The word “spirit” is the most common word and it appears in 26 tweets.

kamisama wechat