As I’ve said before, VuGen makes a great content scraping tool for cases when you want a quick and dirty script to save specific data from multiple webpages.
In this example, I wanted to create a list of all the WordPress plugins available from http://wordpress.org/extend/plugins/ (currently there are 4,245), and save all the metadata about the plugin:
- Number of downloads
- Version number
- Rating
- etc…
vuser_init()
{
// Only download content from "wordpress.org"
web_add_auto_filter("Action=Include", "Host=wordpress.org", LAST);
return 0;
}
/*
Load the next "browse by popularity" page, then load each plugin page in turn.
Save version and statistics data for each plugin to a file.
There are 332 pages, so set number of iterations to 332 in Runtime Settings.
http://wordpress.org/extend/plugins/browse/popular/page/2 - http://wordpress.org/extend/plugins/browse/popular/page/332
*/
Action()
{
int i;
char* file = "C:\TEMP\output.txt";
lr_start_transaction("load page");
web_reg_save_param("PluginName",
"LB=\">",
"RB=",
"Ord=All",
"Search=Body",
LAST);
web_reg_save_param("PluginURL",
"LB=",
"Ord=All",
"Search=Body",
LAST);
web_reg_save_param("NumberOfDownloads",
"LB=Downloads ",
"RB= ",
"Ord=All",
"Search=Body",
LAST);
web_reg_find("Text=WordPress › Most Popular « WordPress Plugins", LAST);
web_reg_find("Text={IterationNum}", LAST);
web_url("Popular Plugins",
"URL=http://wordpress.org/extend/plugins/browse/popular/page/{IterationNum}",
"TargetFrame=",
"Resource=0",
"RecContentType=text/html",
"Referer=",
"Snapshot=t1.inf",
"Mode=HTML",
LAST);
lr_end_transaction("load page",LR_AUTO);
// loop through all plugin pages that are linked from this page.
for (i=1; i<=lr_paramarr_len("PluginName"); i++) {
lr_start_transaction("load plugin page");
web_reg_save_param("Version",
"LB=Version: ",
"RB=",
"Ord=1",
"Search=Body",
LAST);
web_reg_save_param("LastUpdated",
"LB=Last Updated: ",
"RB=
",
"Ord=1",
"Search=Body",
LAST);
web_reg_save_param("RequiresWordPressVersion",
"LB=Requires WordPress Version: ",
"RB=",
"Ord=1",
"Search=Body",
"NotFound=Warning", // this is an optional field
LAST);
web_reg_save_param("CompatibleUpTo",
"LB=Compatible up to: ",
"RB=",
"Ord=1",
"Search=Body",
"NotFound=Warning", // this is an optional field
LAST);
web_reg_save_param("AuthorHomepage",
"LB=Author Homepage »",
"Ord=1",
"Search=Body",
"NotFound=Warning", // this is an optional field
LAST);
web_reg_save_param("PluginHomepage",
"LB= Plugin Homepage » ",
"Ord=1",
"Search=Body",
"NotFound=Warning", // this is an optional field
LAST);
web_reg_save_param("Rating",
"LB=",
"Ord=1",
"Search=Body",
LAST);
web_reg_save_param("NumberOfRatings",
"LB=(",
"RB= ratings)",
"Ord=1",
"Search=Body",
LAST);
web_reg_save_param("FileName",
"LB= ",
"Ord=1",
"Search=Body",
LAST);
lr_save_string(lr_paramarr_idx("PluginURL", i), "URL");
lr_save_string(lr_paramarr_idx("PluginName", i), "Name");
lr_save_string(lr_paramarr_idx("NumberOfDownloads", i), "Downloads");
web_reg_find("Text=« WordPress Plugins", LAST);
web_url("Plugin Page",
"URL={URL}",
"TargetFrame=",
"Resource=0",
"RecContentType=text/html",
"Referer=http://wordpress.org/extend/plugins/browse/popular/",
"Snapshot=t2.inf",
"Mode=HTML",
LAST);
jds_append_to_file(file, lr_eval_string("{Name}t"
"{Version}t"
"{LastUpdated}t"
"{URL}t"
"{PluginHomepage}t"
"{AuthorHomepage}t"
"{RequiresWordPressVersion}t"
"{CompatibleUpTo}t"
"{Downloads}t"
"{Rating}t"
"{NumberOfRatings}t"
"{FileName}tn"));
lr_end_transaction("load plugin page", LR_AUTO);
}
return 0;
}
For those who would like a copy of the raw data, it is available here (904 KB).
3 Comments
Comments are closed.
the raw data isn’t available anymore. Any chance you could rerun your script and send the output to me? thanks.
Hi, is this vugen script not working anymore? could you please fix it.
Thanks.
Hi Stuart,
Its an old post but need to tell you that here you have pointed out a very useful function ‘web_add_auto_filter’ that can be used to filter out some requests.
This functions works all fine in LR 11 but I need to backport my script to LR 6.5.
I have a jsp request which when called in turns, calls 4 other requests. I need to filter out one of these 4 requests. While working in LR 11, I can do that using ‘web_add_filter’ but this function does not seem to be compatible with LR 6.5.
Could you suggest me some other way the same results can be achieved? Your help will be highly appreciated.
Thanks,
Aditi