Apr 10, 2015, 12:07 PM
Post #54 of 102
Re: [stuckinarut] HASH-O-RAMA Data Processing Problem
[In reply to]
Finally have a new version, please see attached. I have read back all your messages over the last week plus in order to implement whats remaining. Forgive me if I have missed anything. It might be worth you going back through all of your messages and retesting / report anything not to your satisfaction. Please read all of the notes below in detail as they may offer explnations to some of your queries.
$perl main.pl --interactive --phase_n=0
$perl main.pl --interactive --phase_n=1
$perl main.pl --interactive --phase_n=2
add --base=path/to/dir to temporarily run on a different set of data.
add --case_sensitive to temporarily turn on case sensitivity.
add --wtf_threshold=7 to temporarily adjust the wtf threshold.
- code better organized. The core algorithm is in one place now, the _input_contestants function. This makes the code more manageable, although still not perfect. I have used a couple of bad practices purely for simplicity. Interactive / configuration interface not that great. Etc etc etc. When the code finally works as desired, we can consider refactoring and improving, but probably isn't necessary.
- if you have problems installing List::MoreUtils, remove the use statement at the top and add 'any' to the List::Util import list. I don't want to upgrade List::Util on my test machine at the moment to support the any function.
- boolean command line arguments now controlled via on = --arg and off = --noarg or don't supply i.e. --case_sensitive, --nocase_sensitive.
- phase2 added since the introduction of adjustments. Once you have run phase1 you can make adjustments then run phase2 to rescore, repeat if necessary. phase2 does not modify the errors log.
- revamped prompt system. For non input prompts, you just type c to continue or e to exit the script. For input files that don't exist or are empty you can input lines if you wish, just type each line / return, then type c when you have finished, especially useful for testing different adjustments.
- no return accuracy has been massively improved, but still not perfect. I have left it in for now, you'll notice the error log is no longer flooded with them. If desired i'll make all errors configurable on / off.
- selfie error added. I hadn't noticed it by eye, but I noticed in the errors log there was one selfie error in your 2014 data!
- outputted total rows at end of each output file in the format "# total = n". I would prefer to keep the # at the beginning as this indicates a comment line and is ignored when read, especially important when reading back in the weights and errors logs.
- I haven't adjusted the weighting algorithm, I will if you are still not satisfied, it is easy to adjust. Remember, a single wtf shouldn't have much meaning alone, it only has meaning as a comparison against other wtfs. The current weighting algorithm is prepared for potential issues that we have not yet seen with our test data. But I do agree with you, that perhaps we should only weight call cnqs, not log cnqs.
- unsubmitted log created during phase1, it contains log signs that were deciphered to come from unsubmitted logs.
- every error now listed under errors column in errors log. As a result, for now I have removed the wtf<wtf_threshold under the wtf column when cnq error.
- looking over your comparisons pdf, there are indeed huge differences between manual and automatic. I believe improvements to this version will reduce this gap, especially the no return. It would be good if you could regenerate this comparison. We should then consider selecting sample records and investigate the differences.
- you described issues with log sign K6NV against call sign VE3KI, but I failed to confirm this during my own investigation upon testing this version. Also you raised this issue when you had muddled up old and new data. Please revist this.
- I didn't go over your case studies pdf in detail, these should be revisited.
- you discussed weighting percentages. I didn't quite understand what you mean. Did you mean that you would like to group wtfs by their sign in the weights log, and calculate their individual percentages.
- I need to re read your notes regarding piped locations in the scores log. Although as far as I can tell, it wouldn't really be possible to figure out that when location x is supplied, they actually meant location y, and you will probably have to control this via the adjustments log instead. Let me know if I didn't understand correctly.
- Always keep in mind that you have the power to accurately control the outcome by adjusting the weights and adjustments logs. Adjusting a weight controls a range of outcomes, while inserting an adjustment controls a specific outcome.