[hcs-d] Wget from PIN-authenticated page

Peter Bailis pbailis at fas.harvard.edu
Tue Dec 7 11:07:02 EST 2010


Hey HCS Hackers,

I'm helping a thesis-writing friend with some automated database lookups on
a pharmacology database.  We've gotten their permission to automate the
scraping, though they don't have an API, so I'm going to do some scraping
using wget (I know there are libraries, though I just want a quick and dirty
script).  This is paywalled through the Harvard PIN API, though, and, as my
initial attempts haven't been too successful, I thought I'd see if anyone
else has experience getting through authentication using wget/knows what I'm
doing wrong/has any other ideas.

My problem is that I'm not sure how to perform the initial authentication on
the PIN login page using wget so I can store the cookie for later accesses.
 I've tried opening the page in Firefox, getting the cookie, then converting
the sqlite (sqlite3 -separator ' ' cookies.sqlite 'select * from
moz_cookies' > cookies.txt), but I still get the PIN page.  Any thoughts?

The site is http://nrs.harvard.edu/urn-3:hul.eresource:clinphar

Thanks,
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.hcs.harvard.edu/pipermail/hcs-discuss/attachments/20101207/d462509a/attachment.htm 


More information about the hcs-discuss mailing list