PHP Code Needed Best answer on the web
Posted in: darrelrussell.com edit
07 Jan 2009
Technorati.com has a search system which tells the user how many incoming links a site has.
When you put in the following format in your browser's address bar, Technorati tells you how many incoming links this site has:
The format is: http://technorati.com/search/www.SiteName.com
To see a working example, see: http://technorati.com/search/www.socioeconomics.com
WHAT I NEED:
I have about 400 sites, of which I want to get the latest # of incoming links.
To achieve this goal, I need someone to write a PHP script which does that automatically in order to save me from the trouble of manually checking them every once in a while.
A script which can take good care of the following would do just fine:
1- Will extract the necessary information from Technorati either simultaneously OR when I run it. (which means that the script does not necessarily have to perform the same task each time the page loads. This would take a lot of time. I can run and see the results myself, and update it every week manually.)
2- Will not require a MySQL to work, and will be self-sufficient.
3- Will order the results top-down.
If it cannot fetch one URL, it stops inquiring for the rest and leaves the user with an empty output file - regardless of when the problem occured.
Here is the error message:
cannot fetch name.blogspot.com at technorati.pl line 26,
I hope it is something simple, because it is almost impossible not to have a fetch problem when I have almost 400 urls in the list.
I was at the 10th step.
I installed perl under c:perlperl
so I typed:
c:perlperlperl technorati.pl site2.txt >output.txt
However, it turned the folliowng error:
"perl is not recognized as an internal or external command, operable program or batch file."
cannot fetch sitename1.com at technorati.pl line 25,
Use of uninitialized value in pattern match
Use of uninitialized value in pattern match
(now, if this is how it should be when the script cannot retrieve the information from technorati site, then fine, we have no problems at all.
But the error message implies that it is something else. Am I wrong?
I did a trial for only 4 sites like you recommended - to make it last not so long.
It worked, but turned only one result.
Does that mean that the other 3 had the value 0?
defined($page) or die 'cannot fetch '.$site;
You can either change it to:
# defined($page) or die 'cannot fetch '.$site;
or:
defined($page) or warn 'cannot fetch '.$site;
The first option will stop the script checking whether the site could be found or not, the second option will warn you that the site could not be found.
Thank-you for your question.
Whilst I was working on this vladimir-ga produced an answer that is nearly complete but fails to sort the list once it is been obtained. I will build on his solution to provide you with a complete answer.
First of all you will require a text file that contains each of your website URL's. Each URL should be on a separate line. This should be placed in the same location as the perl script we will write below. As vladimir-ga stated it should take the form:
www.site1.com
www.site2.com
www.site3.com
...
The ideal solution would make use of the Technorati API but that would require more time than a $40 question would permit and would also require you to register with the Technorati site for an API key. You did not mention in your question whether you had one of these so my solution, like vladimir-ga's, did not use the API method.
The API method may be a quicker solution in the long run as it would probably be quicker. Submitting 400 URL's at once will take some time... in my solution I have built in a 2 second delay before each URL is submitted to Technorati. Without this their server would be unfairly hit and may cause a degradation in their service and result in your IP address being banned in the future. It is always recommended not to "hit" a server too often too quickly when performing tasks such as these out of courtesy.
I was unsure whether you wanted the number of sites that link to your url or what vladimir-ga gave you which was the latest number of links. It is not too difficult to switch back to vladimir-ga's solution.
My script will output a SORTED list with the most linked site in first position. The output will be in the form numberurl. This can easily be changed to another format, just let me know what you require and I can alter this for you.
Finally to run the script on a Windows machine you should use something like this at the command prompt:
perl nameofthescript.pl nameofthetextfilecontainingtheurls.txt
If you have any further questions or queries on this subject please do not hesitate to ask and I will do my best to respond swiftly.
Finally the finished script (with comments and sorting):
#!/usr/bin/perl
# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;
# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;
# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];
# set up some holding variables
my $site;
my @urls;
# loop through the urls and process them
while ($site =
# remove line feeds
chomp($site);
# get the search page
my $page = get 'http://technorati.com/search/'.$site;
# check something is found
defined($page) or die 'cannot fetch '.$site;
# match the number of sites linking to your url
$page =~ m/(.*) sites link/ or next;
# add it to a url in the format:
# number of links url
push @urls, "$1$site";
# wait for 2 seconds before querying Technorati again
sleep(2);
}
# sort the urls
@urls = sort(@urls);
# print out the sorted list
foreach my $link ( @urls ) { print $link."n"; };
# quit the script
exit(0);
the only problem can be that the technorati site is rather busy one.
and when their server is busy, the script marks the url in question '0', instead of n/a.
is there a way to overcome this problem?
I will be able to write this script for you but I am a little unsure as to what you require in one section.
You state "the script does not necessarily have to perform the same task each time the page loads", can you explain what you mean by this?
As I understand it you require a script that searches Technorati for your 400 websites and then lists your sites in order of which ones are the most popular (have the most links). These results will be stored in a text file rather than an MySQL database.
Does the script have to be written in PHP? This kind of checking might be easier done in Perl/CGI.
Please let me know what you mean by the above phrase so I can begin work on this and whether a Perl script would also be acceptable.
You should be able to run the script on any operating system but you will probably need to install ActivePerl if you intend to run the script on a Windows machine. This is free software and can be downloaded from here: http://www.activestate.com/Products/ActivePerl/?mp=1
If you are running Windows and do not have ActivePerl installed (you probably will not have if you have never done any Perl programming before), please proceed to start downloading this while you are waiting for my response to your answer of what operating system you are running the script on.
The PHP rewrite will therefore involve more time than your offer of the $20 tip would allow. I had already estimated this would take a couple of hours for the $20 :( I hope you understand.
On my computer, I have windows.
Could you please try the script yourself first, and send me a final version?
Thanks!
1- As for clarification about what I meant with the phrase "the script does not necessarily have to perform the same task each time the page loads":
I will publish the results on a web page. And I thought it would not be the best way to do this that way, if the script is to extract latest results from technorati each time it loads.
Therefore I thought maybe I should run the script offline myself every, shall we say, 10 days, and manually publish the top20 results on a static page in order for the page to load faster.
I wanted to let you know about this concern of mine beforehand - just considering the possibility that it might affect the way you write the script. (if it doesn't, simply ignore it!)
2- PHP would be great. But if you think that you can do it with perl in a better way, and that it will work with no problems, then I can say yes to perl too.
I hope these replies have answered your question. Feel free let me know if you need any further details.
You are correct in your assumption that the other 3 had no links, this is because of this line in the script:
$page =~ m/(.*) sites link/ or next;
This line is saying, check the contents of the Technorati page and search for a certain pattern. If this pattern is matched then continue with the rest of the while loop otherwise lets go on to the next url.
If you wish to double check the script I would recommend using urls that you know will bring up a solution (I used www.google.com, www.ebay.com and www.yahoo.com).
If you wish me to alter the script slightly so that it becomes 0url for when no sites are linking please let me know.
[This will be my last opportunity to respond to any clarifications tonight. I will respond to any further ones you have in the morning.]
#!/usr/bin/perl
# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;
# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;
# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];
# set up some holding variables
my $site;
my @urls;
# loop through the urls and process them
while ($site =
# remove line feeds
chomp($site);
# get the search page
my $page = get 'http://technorati.com/search/'.$site;
# check something is found
defined($page) or die 'cannot fetch '.$site;
# match the number of sites linking to your url
if ( $page =~ m/(.*) sites link/ ) {
# add it to a url in the format:
# number of links url
my $links = $1;
$links =~ s/,//g;
push @urls, "$links$site";
}
else { push @urls, "0$site"; };
}
# sort the urls
@urls = sort {
($b =~ /(d+)/)[0] <=> ($a =~ /(d+)/)[0]
uc($a) cmp uc($b)
} @urls;
# print out the sorted list
foreach my $link ( @urls ) { print $link."n"; };
If you specifically want me to answer this question also you can put "For palitoy-ga" as the question title.
Step 10 should probably be this:
c:perlperlbinperl.exe technorati.pl site2.txt >output.txt
You need to locate where the perl.exe file is in your perl installation. As you have installed Perl at C:perlperl it should be in the bin folder mentioned above.
You may wish to add Perl to your system path environment variable, there are easy to follow instructions here: http://www.peacefire.org/circumventor/adding-perl-to-path-variable.html
Again you will need to alter c:perl to c:perlperl
This means in future you would only need to type:
perl technorati.pl site2.txt >output.txt
Let me know how you get on with this...
No problem, I will add this to the script for you and post the new script here once I have completed this addition.
2- If you can add the sorting function, looks like we will be done.
This is already included :) Or was there a problem when you were testing it? It appeared to work correctly when I tested it...
3- Is it possible for you to write a similar PHP script too?
I will work on this for you this morning and should hopefully have a working solution in a few hours (as it will take this amount of time to write and test).
132 links www.sitenumber1.com
127 links www.sitenumber2.com
117 links www.sitenumber3.com
...and so on.
So it looks like you got me right.
However, I do not know anything about command lines or how to run/execute this script.
I have a site.txt file with a list of the URLs.
I also have the texhnorati.pl file with your script in it.
I do not know what to do next.
Please clarify.
Thanks.
2- If you can add the sorting function, looks like we will be done.
3- Is it possible for you to write a similar PHP script too? I can tip $20 if it is.
The following script will alter it so that the "Use of uninitialized value in pattern match" warning is not displayed and only an error stating "cannot fetch xyz.com".
#!/usr/bin/perl
# set up required modules and stuff
use strict;
use warnings;
use LWP::Simple;
# if a parameter is not passed to the script then stop
die "parameter missing" if @ARGV != 1;
# read the urls to check
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];
# set up some holding variables
my $site;
my @urls;
# loop through the urls and process them
while ($site =
# remove line feeds
chomp($site);
# get the search page
my $page = get 'http://technorati.com/search/'.$site;
# check something is found
if ( defined($page) ) {
# match the number of sites linking to your url
if ( $page =~ m/(.*) sites link/ ) {
# add it to a url in the format:
# number of links url
my $links = $1;
$links =~ s/,//g;
push @urls, "$links$site";
}
else { push @urls, "0$site"; };
}
else { print 'cannot fetch '.$site."n"; };
}
# sort the urls
@urls = sort {
($b =~ /(d+)/)[0] <=> ($a =~ /(d+)/)[0]
uc($a) cmp uc($b)
} @urls;
# print out the sorted list
foreach my $link ( @urls ) { print $link."n"; };
Here are the steps you require:
1) Double-click on My Computer and then C: (your main hard disk).
2) Create a folder in C: with a name of your choice (I will call it urlupdate)
3) Copy the perl script and text file containing your URL's to C:/urlupdate (the folder you made in part 2 above). 4) Go to the ActivePerl website and download the free software:
http://www.activestate.com/Products/ActivePerl/?mp=1
5) Install the software by clicking on the program you download and follow the instructions. 6) This will install Perl on your system.
7) Go to Start->All Programs->Accessories and choose Command Prompt (alternatively Start->Run and type "command" in the window that appears). 8) A new mainly black window should appear. This is the command prompt, type:
cd C:urlupdate
9) You are using DOS commands and this has located your position to the urlupdate folder you made in step 2. 10) Now type:
perl nameofperlscript.pl nameoftextfile.txt >nameofyourchoicefortheoutput.txt
11) This should start running the script (I would initially ensure you only have a few URL's in the text file just to make sure it is working!). It will take some time before the process completes (it took about 20 seconds on my PC to do 3 URL's earlier when I was writing the program). Completion is indicated by the fact you can type something else in to the command prompt.
I know this must seem quite daunting but I am here to help you through each step. If you get any error messages please ask for clarification, state the step you got to, the error message and I will do my best to respond swiftly (I should be here for another 2 hours today and all day tomorrow).
Can you do both for a tip of $40?
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
die "parameter missing" if @ARGV != 1;
open (URLS, $ARGV[0]) or die "cannot open ".$ARGV[0];
print "n";
print "
| $site | $1 |
print "n";
You run it with a single parameter, being the name of a text file that lists the sites that we want to check (one address per line). The file could look like this:
www.site1.com
www.site2.com
www.site3.com
...
The script fetches the information from Technorati and prints a simple HTML on standard output. You could use it like so (assuming you saved the script in a file called technolinks.pl and the list of sites is in a file called sites.txt and that you're on some kind of Linux/Unix):
./technolinks.pl sites.txt > output.html
You get your report in the file output.html that is ready to be served via a web server. (Of course it could use some nice formatting.) There is no dynamic script running every time someone wants to view the report, you manually (or mechanically via cron etc.) update the report by running the Perl script.