CGI/Perl Guide | Learning Center | Forums | Advertise | Login
Site Search: in

  Main Index MAIN
INDEX
Search Posts SEARCH
POSTS
Who's Online WHO'S
ONLINE
Log in LOG
IN

Home: Need a Custom or Prewritten Perl Program?: I need a program that...:
Google Scholar Scraper and Excel Parser

 



ochez
Novice

Jul 8, 2009, 5:57 AM

Post #1 of 7 (4317 views)
Google Scholar Scraper and Excel Parser Can't Post

My boss wants me to write a program that takes a file with his list of published works (could be Excel or Word doc...I figured Excel would be easier), searches them on Google Scholar, then returns the "Cited by:" number for each paper.

I found a nice script at http://davide.eynard.it/cgi-bin/perlcode.pl?file=scholar.pl that scrapes Google Scholar, but for the life of me I can't combine my goals.

Any immediate help would be awesome!


JenniC
Novice

Jul 8, 2009, 11:49 AM

Post #2 of 7 (4307 views)
Re: [ochez] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

    

This is fairly simple using biterscripting. Let's say your search term is "Prof. XYZ". The google URL will be "http://scholar.google.com/scholar?q=Prof.%20XYZ".







Code
   

# Script scholar.txt
var str page ; cat "http://scholar.google.com/scholar?q=Prof.%20XYZ" > $page
# Keep collecting and printing the string between "cited by " and "<".
while ( { sen -c -r "^cited by &<^" $page } > 0 )
do
var str match ; stex -c -r "^cited by &<^" $page > $match
stex -c -r "^cited by ^]" $match > null ; stex -c -r "[^<^" $match > null
# $match now has only the number following "cited by ". Print it.
echo $match
done



I tested this script. It works. Try it now. Download biterscripting ( http://www.biterscripting.com ). Save the script as C:\Scripts\scholar.txt. Call it as


Code
  script scholar.txt



You can also call it from a perl program. Or, you can translate the functionality to perl. If you make the script better, please post it. I think a lot of people can benefit from your better script.

I use biterscripting to parse/scrape our own web pages.

Jenni


(This post was edited by JenniC on Jul 8, 2009, 11:50 AM)


ochez
Novice

Jul 9, 2009, 8:32 AM

Post #3 of 7 (4284 views)
Re: [JenniC] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

Thank you for the code, I'm about to try it out...however, the main problem I'm having is combining results like these with a script that opens up an Excel file with a list of papers to search for, and then scrapes Google Scholar and returns the number of citations to a column in the Excel sheet.

I've written code that does both of these separately, but I can't seem to combine them. I'd post the code if needed, but I'm not sure it's worth it


JenniC
Novice

Jul 10, 2009, 9:05 AM

Post #4 of 7 (4266 views)
Re: [ochez] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

ochez

Today is your lucky day. I was able to come up with just the right script for you.

If you have a list of search titles in an Excel file, let's change our scholar.txt script as follows.

1. We will take the search title as an argument instead of hard-coding.

2. We will wrap the search title within double quotes so google scholar will find that exact title.

3. We will use the first (and only) "cited by" number.

Here is the resulting scholar.txt script.


Code
   
# Script scholar.txt

# Input argument - search title

var str title
var str page ; cat ("http://scholar.google.com/scholar?q="+"\""+$title+"\"") > $page
# Get string between "cited by " and "<".

var str match ; set $match = "0"
if ( { sen -c -r "^cited by &<^" $page } > 0 )
do
stex -c -r "^cited by &<^" $page > $match
stex -c -r "^cited by ^]" $match > null ; stex -c -r "[^<^" $match > null
done
endif

# $match now has the number. Print it.

# If no matches were found, $match is "0".
echo $match





Save this script as C:\Scripts\scholar.com.

Let's now write another script that will read entries from Excel file, call scholar.txt script and write results back to the excel file, one by one. We will assume the excel file has the search title in the first column, it is tab-separated, and is at C:\X.txt.




Code
   
# Script excel.txt

# Read excel file.

var str input, output ; cat "C:\X.txt" > $input

# Process entries one by one.

while ($input <> "")

do

var str entry ; lex "1" $input > $entry

var str count ; script scholar.txt title($entry) > $count

set $output = $output+$entry+"\t"+$count+"\n"

done

# The updated output is in $output. Write it back.

echo $output > "C:\X.txt"





Save this script as C:\Scripts\excel.txt. Start biterscripting, enter the following command.


Code
  script excel.txt



When the script has completed running, the results will be in X.txt. Open X.txt with excel, or print it.

This time, I have not been able to test. So, test first.

If you improve upon these scripts, do post them back. There is a value in them. There is always a value in automating something that one would otherwise need to do manually - typing, clicking, reading, entering - item by item by item.

Jenni


(This post was edited by JenniC on Jul 10, 2009, 10:33 AM)


ochez
Novice

Jul 10, 2009, 11:08 AM

Post #5 of 7 (4256 views)
Re: [JenniC] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

Thanks so much, it works great! I just had to add an "endif" in scholar.txt and it was good to go!


ochez
Novice

Jul 10, 2009, 11:34 AM

Post #6 of 7 (4254 views)
Re: [ochez] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

It also helps to change the output file to a separate file so that the original input file isn't changed


JenniC
Novice

Jul 10, 2009, 11:34 AM

Post #7 of 7 (4254 views)
Re: [ochez] Google Scholar Scraper and Excel Parser [In reply to] Can't Post

ochez

You are welcome. Glad to be of help.

(I noticed the missing endif a little later, and edited my post to add it. So, the scripts are now correct.)

If anyone needs this in the future, you will need to download biterscripting ( http://www.biterscripting.com ) to run the scripts posted here.

Jenni


(This post was edited by JenniC on Jul 10, 2009, 11:39 AM)

 
 


Search for (options) Powered by Gossamer Forum v.1.2.0

Web Applications & Managed Hosting Powered by Gossamer Threads
Visit our Mailing List Archives