Mar 27, 2013, 4:32 PM
Post #1 of 1
Need to partially automate user-driven scraping. Tight Spec, no arguments about completion.
| Private Reply
I need some perl coding to semi-automate extraction of US teacher email addresses at school, not at home. The Perl (or other?) program does not have to be clever, and since the spec is fairly complete there shouldn't be much back-and-forth about whether the job is completed.
I HAVE A CSV FILE WITH THESE FIELDS:
teacher first name
teacher last name
school street address
The last two fields start out blank and are filled with the following values:
(1) "search status" (blank: nothing tried; "found"; "last pattern tried": 1..4; abandoned; email obscured; timed out; (others ?)
(2) "email address", the email address found, or blank if not (yet) found.
These statuses are click driven and filled in by program.
THE SCREEN HAS TWO WINDOWS
The first window is the program window which I think is, for now, a static UI but with a one-line message / prompt bar at top. There are some buttons with lit/dark states, there are some radio buttons, and there are a few plain old lit/dark readouts. At the bottom of the window is are two message areas of one line each. No scrolling for now.
The second window is a browser with google search.
The browser window is indicated to the perl program by the user during initialization in some way.
We load the table and scan to the first record that has a blank "search status"; or we are at EOF and exit with message and flush and close the file with a message in the
The current record is displayed in the program window.
The user clicks on a search method (1..4 radio buttons, but just 1 is available now) and the radio button lights up for clarity. (Became dark when the current record changed).
The program does a search on school name plus state (e.g., "Akron High School OH") and from the search results the operator copies the school's URL and clicks a button in the program window. (Or, if possible, the user just highlights the URL and the program copies it, expanding the selection in both directions and then trims of any following /'s, etc.
The program pulls the school's url from the clipboard (or from the program's own buffer if the screen scrape above is possible) and composes, e.g., the following google search string.
"site:ahs.k12.oh.us Lisa Simpson" without the quotes, and does the search.
The user then browses around and if found copies the email address into the clipboard (or again, maybe the program can screen-scrape and clean) and clicks the "found" button, or fails for some reason and clicks a different "search status"
Whatever the click, it prompts the program to install the values for "search status" and "email address" into those fields, and then display the revised record.
The operator approves and clicks "next record" which goes to the next record with blank "search status".
There will be some user recovery buttons, and the first to be implemented will be "redo record", typically because the operator saw something wrong, or had pushed a wrong "search status" button at the end of the search. So this is included in the first pass on the program.
WHAT SAY YOU?:
OK, lots of description but not all that much action, I believe. Anybody want to talk about some coding?
(This post was edited by evan1138 on Mar 27, 2013, 5:19 PM)