May 9, 2013, 6:14 PM
Post #1 of 5
I have a homework assignment that doesn't make any sense to me. In the code block is what I'm given. I have never done scripting w/ webpages before, not really sure how to get started. The part I could use some pointers on is Part 2 in that I don't know how to tell if a "file" is reachable or not.
How to tell if a file is reachable
I have tried putting "wget http://www.oracle.com/us/solutions/index.html" into my shell and a bunch of info about the site comes up, but not sure what do to with it.
Here is the writeup:
You have a web site containing static pages, such as www.oracle.com and you wish to verify the site.
Part 1: Static Verifier
Write an application called static_verifier that takes command-line arguments directory and a base URI (i.e. Uniform Resource Identifier, such as h_tp://www.oracle.com/us/solutions/index.html The application will scan all .html files in the directory and its subdirectories for <a> (anchor tags) and <img> (image tags) to find linked files.
For each link, determine whether it points to an internal (this site) or external resource. If it is internal, verify whether the file exists in your snapshot. Output should consist of the file name, the missing internal links, the valid internal links, and the external links. Indent each section. Within each section, list the names alphabetically so they will be diff-compatible with baseline data.
Sample Output (from modified index.html):
Missing Internal Links
Valid Internal Links
Write a program called file_verifier that will take the same arguments as static_verifier.
Verify that each file in the subtree is reachable directly or indirectly from the the homepage (index.html). Print out the list of unreachable files in sorted order.