- The MarvelPaths Application
- Problem 0: Understanding the Marvel Universe Data
- Problem 1: Building the Graph [60 points with Problem 2]
- Problem 2: Finding Paths [60 points with Problem 1]
- Problem 3: Testing Your Solution [12 points]
- Problem 4: Running Your Solution From the Command-Line [3 points]
- Test script file format
- Paths to Files
- What to Turn In
In this assignment, you will put the graph you designed in Homework 5 to use by modeling the Marvel Comics universe. By trying out your Graph ADT in a client application, you will be able to test the usability of your design as well as the correctness, efficiency, and scalability of your implementation.
This application builds a graph containing thousands of nodes and edges. At this size, you may discover performance issues that weren’t revealed by your unit tests on smaller graphs. With a well-designed implementation, your program will run in a matter of seconds. Bugs or suboptimal data structures can increase the runtime to anywhere from several minutes to 30 minutes or more. If this is the case, you may want or need to go back and revisit your graph implementation from Homework 5. Remember that different graph representations have widely varying time complexities for various operations and this, not a coding bug, may explain the slowness.
This Marvel dataset was also used by researchers at Cornell University, who published a research paper showing that the graph is strikingly similar to “real-life” social networks.
In this application, your graph models a social network among characters in Marvel comic books. Each node in the graph represents a character, and an edge
⟨Char1,Char2⟩ indicates that
Char1 appeared in a comic book that
Char2 also appeared in. There should be a separate edge for every comic book any two characters appear in, labeled with the name of the book. For example, if Zeus and Hercules appeared in five issues of a given series, then Zeus would have five edges to Hercules, and Hercules would have five edges to Zeus.
You will write a class
hw6.MarvelPaths (in file
MarvelPaths.java) that reads the Marvel data from a file (
marvel.tsv), builds a graph, and finds paths between characters in the graph.
As you complete this assignment, you may need to modify the implementation and perhaps the public interface of your Graph ADT. Briefly document any changes you made and why in
hw6/changes.txt (no more than 1-2 sentences per change). If you made no changes, state that explicitly. You don’t need to track and document cosmetic and other minor changes, such as renaming a variable; we are interested in substantial changes to your API or implementation, such as adding a public method or using a different data structure. Describe logical changes rather than precisely how each line of your code was altered. For example, “I switched the data structure for storing all the nodes from a ___ to a ___ because ___” is more helpful than, “I changed line 27 from
nodes = new ___(); to
nodes = new ____();.”
Leave your graph in the
hw5 package where it was originally written, even if you modify it for this assignment. There is no need to copy files nor duplicate code! You can just
import and use it in HW6. If you do modify your
hw5 code, be sure to commit those changes your repository.
Before you get started, look at the file with the Marvel data:
hw6/data/marvel.tsv. A TSV (“tab-separated value”) file consists of human-readable data delineated by tabs, and can be opened in your favorite text editor, spreadsheet, or Eclipse. (If you don’t have a favorite text editor, Notepad++ is a simple, free, easy-to-use option for Windows. Textwrangler is a similar editor for Mac OS X.) Each line in
marvel.tsv is of the form
where character is the name of a character, book is the title of a comic book that the character appeared in, and the two fields are separated by a tab.
The first step in your program is to construct your graph of the Marvel universe from a data file. We provide you with a class
MarvelParser to help.
MarvelParser has one static method,
parseData, which reads data from
marvel.tsv, or any file structured in the same format.
parseData creates a
Set of all characters and a
Map from each book to a
List of the characters in that book. These are not the data structures you want, however; you want a Graph.
You may modify
MarvelParser however you wish to fit your implementation. You might change the method signature (parameters and return value) of
parseData, or you might leave
parseData as is and write code that processes its output. The only constraint is that your code needs to have a filename as a parameter so the parser can be reused with any file.
At this point, it’s a good idea to test the parsing and graph-building operation in isolation. Verify that your program builds the graph correctly before continuing. The assignment formally requires this in Problem 3.
The real meat (or tofu) of
MarvelPaths is the ability to find paths between two characters in the graph. Given the name of two characters,
MarvelPaths searches for and returns a path through the graph connecting them. How the path is subsequently used, or the format in which it is printed out, depends on the requirements of the particular application using
MarvelPaths, such as your test driver (see below).
Your program should return the shortest path found via breadth-first search (BFS). A BFS from node u to node v visits all of u’s neighbors first, then all of u’s neighbors’ neighbors’, then all of u’s neighbors’ neighbors’ neighbors’, and so on until v is found or all nodes with a path from u have been visited. Below is a general BFS pseudocode algorithm to find the shortest path between two nodes in a graph G. For readability, you should use more descriptive variable names in your actual code than are needed in the pseudocode:
start = starting node dest = destination node Q = queue, or "worklist", of nodes to visit: initially empty M = map from nodes to paths: initially empty. // Each key in M is a visited node. // Each value is a path from start to that node. // A path is a list; you decide whether it is a list of nodes, or edges, // or node data, or edge data, or nodes and edges, or something else. Add start to Q Add start-> to M (start mapped to an empty list) while Q is not empty: dequeue next node n if n is dest return the path associated with n in M for each edge e=⟨n,m⟩: if m is not in M, i.e. m has not been visited: let p be the path n maps to in M let p' be the path formed by appending e to p add m->p' to M add m to Q If the loop terminates, then no path exists from start to dest. The implementation should indicate this to the client.
Here are some facts about the algorithm:
- It is a loop invariant that every element of Q is a key in M.
- If the graph were not a multigraph, the for loop could have been equivalently expressed as
for each neighbor m of n:.
- If a path exists from start to dest, then the algorithm returns a shortest path.
Many character pairs will have multiple paths. For grading purposes, your program should return the lexicographically (alphabetically) least path. More precisely, it should pick the lexicographically first character at each next step in the path, and if those characters appear in several comic books together, it should print the lexicographically lowest title of a comic book that they both appear in. The BFS algorithm above can be easily modified to support this ordering: in the for-each loop, visit edges in increasing order of m‘s character name, with edges to the same character visited in increasing order of comic book title. This is not meant to imply that your graph should store data in this order; it is merely a convenience for grading.
Because of this application-specific behavior, you should implement your BFS algorithm in
MarvelPaths rather than directly in your graph, as other hypothetical applications that might need BFS probably would not need this special ordering. Further, other applications using the graph ADT might need to use a different search algorithm, so we don’t want to hard-code a particular search algorithm in the graph ADT.
Using the full Marvel dataset, your program must be able to construct the graph and find a path in less than 30 seconds on attu.
ScriptFileTests, the provided class used to run your specification tests, is set to put a 30000 ms (30 second) timeout for each test. (As a point of reference, the staff solution took around 5 seconds to run on attu).
If your Graph ADT has a particularly thorough or expensive
checkRep function it is possible that it could take too long to load the Marvel dataset when all checks are performed. A good
checkRep function should include some way to disable particularly expensive checks by setting a compile-time or run-time variable. You are free to disable the expensive checkRep tests in the version you submit for grading so it will pass all tests in the alloted time.
Because the Marvel graph contains thousands of nodes and hundreds of thousands of edges, using it for correctness testing is probably a bad idea. By contrast, using it for scalability testing is a great idea, but that should come after correctness testing using much smaller graphs. In addition, it is important to be able to test your parsing/graph-building and BFS operations in isolation, separately from each other. For these reasons, you will use a test driver to verify the correctness of both your parser and BFS algorithm on much smaller graphs.
You will write specification tests using test scripts similar to those you wrote in Homework 5. The format is defined in the test file format section below. In addition to writing
*.expected files as before, you will write
*.tsv files in the format defined for
marvel.tsv to test your parser. All these files will go in the
You should create a class named
HW6TestDriver that runs tests in the same way that
HW5TestDriver did. We have provided a starter file in
test/ with method stubs and a small amount of starter code, but you can replace it. You have two choices for your approach:
- You may copy your
hw5/test/HW5TestDriver.javasource file to
hw6/test/HW6TestDriver.java, change it to package
hw6.test, and revise it to match this assignment’s file format specification. This will cause duplicated code between the two problem sets, but avoids some tricky issues with subclassing. This is probably the easier option.
- Alternatively, you may choose to write a new
HW6TestDriverclass that extends the
HW5TestDriverand reuses some of its code. This approach is more elegant. You will need to edit your
HW5TestDriverto make it more extensible — in particular, by making some of its methods and/or fields protected instead of private. Also, static methods may not have the inheritance properties you want. You may find it easiest to copy the
HW5TestDriverand change the references from HW5 to HW6. However you decide to change
HW5TestDriver, it must continue to work correctly for Homework 5.
Your specification tests will most likely cover the bulk of your testing. You may need to test additional behaviors specific to your implementation, such as handling of edge cases. If so, write these tests in a regular JUnit test class or classes and add the class name(s) to
ImplementationTests.java. If you specify data files to load in your implementation tests, see the File Paths section for information about specifying filenames. Be sure to run your tests on attu with
ant validate to verify that the data files are still found successfully under the configuration in which we will run your tests.
main method to
MarvelPaths that allows a user to interact with your program from the command line (your code should read user input from
System.in). The program should read pairs of character names and then print to the console the path between those two characters if any, or else print a suitable message (your code should write to
System.out). There is no rigid specification here for input or output formats, but especially creative ones may receive a small amount of extra credit.
To run your program, a user would run
ant build java -ea -cp bin hw6.MarvelPaths
cse331 directory; for example, in Unix you would do
cd ~/cse331 before running the above commands.
The format of the test file is similar to the format described in Homework 5. As before, the test driver manages a collection of named graphs and accepts commands for creating and operating on those graphs.
Each input file has one command per line, and each command consists of whitespace-separated words, the first of which is the command name and the remainder of which are arguments. Lines starting with # are considered comment lines and should be echoed to the output when running the test script. Lines that are blank should cause a blank line to be printed to the output.
The behavior of the testing driver on malformed input command files is not defined; you may assume the input files are well-formed.
In addition to all the same commands (and the same output) as in Homework 5, the Homework 6 test driver accepts the following new commands:
LoadGraph graphName file.tsv
Creates a new graph named graphName from file.tsv, where file.tsv is a data file of the format defined for
marvel.tsv and is located in the
src/hw6/data directory of your project. The command’s output is
loaded graph graphName
You may assume file.tsv is well-formed; the behavior for malformed input files is not defined.
Filenames supplied to the LoadGraph command should be simple, meaning they do not include the directory in which they are located. For example,
marvel.tsv is a simple filename;
src/hw6/data/marvel.tsv is not. Then, when your program actually loads the file, it must create and use precisely a relative pathname
src/hw6/data/file.tsv in order to work with the autograder. (If you are running your program from the command line rather than from Eclipse, see the File Paths section for help configuring your program to find the file at this path.)
FindPath graphName node_1 node_n
Find the shortest path from node_1 to node_n in the graph using your breadth-first search algorithm. For this command only, underscores in node names should be converted to spaces before being passed into any methods external to the test driver. For example, “node_1” would become “node 1”. This is to allow the test scripts to work with the full Marvel dataset, where many character names contain whitespace (but none contain underscores). Anywhere a node is printed in the output for this command, the underscores should be replaced by spaces, as shown below.
Paths should be chosen using the lexicographic ordering described in Problem 2. If a path is found, the command prints the path in the format:
path from CHAR 1 to CHAR N: CHAR 1 to CHAR 2 via BOOK 1 CHAR 2 to CHAR 3 via BOOK 2 ... CHAR N-1 to CHAR N via BOOK N-1
where CHAR 1 is the first node listed in the arguments to FindPath, CHAR N is the second node listed in the arguments, and BOOK K is the title of a book that CHAR K-1 and CHAR K appeared in.
Not all characters may have a path between them. If the user gives two valid node arguments CHAR_1 and CHAR_2 that have no path in the specified graph, print:
path from CHAR 1 to CHAR N: no path found
If a character name CHAR was not in the original dataset, print:
unknown character CHAR
where, as before, any underscores in the name are replaced by spaces. If neither character is in the original dataset, print the line twice: first for CHAR 1, then for CHAR N. These should be the only lines your program prints in this case — i.e., do not print the regular “path not found” output too.
What if the user asks for the path from a character in the dataset to itself? Print:
path from C to C:
but nothing else.
(Hint: a well-written solution requires no special handling of this case.)
This applies only to characters in the dataset: a request for a path from a character that is not in the dataset to itself should print the the usual “unknown character C“ output.
Several sample test files demonstrating the new commands are provided in
To match the behavior of Eclipse, the autograder and Ant are configured to run your program from your
cse331 project directory. This means that your program must load files using a relative pathname that begins with
src/hwN/data/foo.txt. Your test driver in particular must look for files specifically within
src/hw6/data, as we copy some additional data files to that directory when running the autograder.
If you usually run your program from the command line rather than through Eclipse, you have probably been running it from the
bin directory. From
bin, your program won’t be able to find files using relative pathnames of the form above (i.e. starting with
src), so you need to run it directly from
cse331. To do so,
cd into the
cse331 directory, then tell Java where to look for your class files by adding
bin to the classpath:
java -ea -cp bin hw6.YourClassName
Alternatively, you can run your tests with Ant by running
ant test from the
src/hw6 directory. Unlike
ant validate, this command runs on any machine and does not require your changes to be committed first.
Lastly, be aware that machine specs affect not only how fast your program runs but also how much memory it is allowed to use (more precisely, the maximum heap size). If you’re working on a souped-up machine at home, it’s particularly important that you test on attu.
An easy way to time your program is to run
ant test on attu, just like you run
ant validate. It will run all your implementation and specification tests and write how long each test took to a file. You can also run
ant test.impl or
ant test.spec to run just the implementation or specification tests, respectively.
If your program takes an excessively long time to construct the graph for the full Marvel dataset, first make sure that it parses the data and builds the graph correctly for a very small dataset. If it does, here are a few questions to consider:
- What data structures are you using in your graph? What is their “big-O” runtime? Are there others that are better suited to the purpose?
- Did you remember to correctly override
hashCodeif you overrode
- What is the “big-O” runtime of your
checkRepmethod? Does performance improve if you comment it out or disable expensive checks?
As always, remember to:
- Use descriptive variables names (especially in the BFS algorithm) and inline comments as appropriate.
- Include an abstraction function, representation invariant, and checkRep in all classes that represent an ADT. If a class does not represent an ADT, place a comment that explicitly says so where the AF and RI would normally go. (For example, classes that contain only static methods and are never constructed usually do not represent an ADT.) Please come to office hours if you feel unsure about what counts as an ADT and what doesn’t.
You should add, commit, and push the following files to your CSE 331 GitLab repository:
hw6/*.java[Other classes you create, if any (there may be none!)]
hw6/test/*Test.java[Other JUnit test classes you create, if any]
Additionally, be sure to commit any updates you may have made to the following files, so we have the correct version for this HW assignment:
hw5/*.java[Your graph ADT]
When you have committed and pushed all of your changes and are done with the assignment, you should create a git tag in your repository named
hw6-final and push that tag to your repository. Once you have committed and pushed that tag, your assignment has been submitted and the staff will grade the files in your repository that are labeled with that tag.
ant validate to verify that you have submitted all files and that your code compiles and runs correctly when compiled with
javac on attu.
For problems or questions regarding this page, contact: firstname.lastname@example.org.