Link exchanges are commonplace on the internet these days. Developing linking partners is a great way to gain links to a new or existing websites in an effort to increase their traffic and search engine performance. But what happens when someone agrees to link to you and then removes your link a few days, weeks, months, or years later? In this article, I am going to teach you how to write a PHP script that automatically will crawl your link partners’ web pages and check for links to your site.

Note: I am sure that scripts and programs already exist to do this for you (I haven’t searched for any), but this method is free and will help build your web programming skills.

The basic idea behind link exchange is that you are giving a link with the intent of getting a link in return. Some sneaky webmasters might exchange links with you, only to turn around and remove or nofollow your link a few months down the road, hoping that you don’t notice. In most cases you wont notice because it is simply too much work to constantly check up on all of your link partners to make sure that they are still linking to you as agreed.

There is, however, a much easier way to check up on them. You can create a simple PHP script that does everything for you. I am going to teach you how this is done in three phases, each one building upon the principles of the previous. In the end, you will have created a script that can count backlinks on multiple websites and flags nofollow links.

Phase 1: Checking for Your Domain Name

This first script is extremely simple. All it does is get the contents of another web page, check if your domain name is within the source, and print the output.

<?php
# define some variables
$needle = ‘derekbeau.com’;
$haystack = ‘http://www.lurksteraz.com/index.php’;

# get the remote page contents
$haystack_source = file_get_contents($haystack);

# check if our domain is on the page
if (strpos($haystack_source, $needle)) { print “SUCCESS: ‘”.$needle.”‘ is on ‘”.$haystack.”‘<br />”; }
else { print “FAILURE: ‘”.$needle.”‘ is <strong>NOT</strong> on ‘”.$haystack.”‘<br />”; }
?>

The first couple lines are for defining what we are searching for and in. The ‘$needle’ is your domain name. We can only search for the part that every link must contain because we do not know how other webmasters are going to link to us (such as with or without ‘www’). The ‘$haystack’ variable is the webpage that we are searching

Next, we call the ‘file_get_contents‘ function to download the HTML of the ‘$haystack’ page. We store this data as ‘$haystack_source’ and then use the ‘str_pos‘ function to find out if our ‘$needle’ exists. If it does exist, we print that it was a success, if it does not exist, we print that it was a failure.

That’s it! The main problem with this script is that it will return a success even if the partner site just mentions your domain name as text. We need to make it better.

Phase 2: Counting Links and Finding Nofollows

Moving on and building from the first phase, we are now going to use regular expressions to search for actual links (unlinked text wont count anymore). We are also going to count the amount of links and check each one for the nofollow attribute.

<?php
# define some variables
$needle = ‘derekbeau.com’;
$haystack = ‘http://www.lurksteraz.com/index.php’;

# get the remote page contents
$haystack_source = file_get_contents($haystack);

# get our links
$pattern = ‘/<a[^>]*href=[^>]*’.$needle.’(.*?)</a>/i’;
preg_match_all($pattern, $haystack_source, $matches);

# throw away extra data and count links
$matches = $matches[0];
$total_links = count($matches);

# search all links for nofollow
$nofollow = 0;
foreach ($matches AS $match) {
if (strpos($match, ‘nofollow’)) { $nofollow++; }
}

# compare total links and nofollow to form a conclusion
if ($total_links > 0 && $nofollow < $total_links) {
print “SUCCESS: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed”;
}
else if ($total_links > 0 && $nofollow >= $total_links) {
print “NOFOLLOW: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed”;
}
else if ($total_links == 0) {
print “FAILURE: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed”;
}
else { print “UNKNOWN RESULT”; }
?>

Ok, so this time it is a bit longer than before, but we haven’t really added that much difficulty. The first difference that you will notice is the ‘$pattern’ variable. This is the regex pattern that we are searching for in the source code. I can’t really explain everything in there because regex is a completely different topic, but it basically searches for any ‘<a></a>’ link that contains your domain in the ‘href=’ attribute.

With that pattern defined, we then call the ‘preg_match_all‘ function to get all of the matches for it in the HTML of the page we are checking. Each match is then stored in the ‘$matches’ array. After getting all of the links, we use a ‘foreach loop‘ to check for ‘nofollow’ in each one individually, using ’str_pos’ just like in phase 1. If we find it, we increment the ‘$nofollow’ variable which is serving as a counter.

Note: There is one caveat to this ‘nofollow’ search. If that word is anywhere within the link (alt, title, href, anchor), it will return positive. Since that would be a rare case in most “normal” niches, I am not worrying about it. To check for ‘rel=nofollow’ you would have to use multiple str_pos calls or a regex search because there are multiple ways that it can be written. I chose to use ’str_pos’ because it is the fastest option and uses the least amount of server resources.

Now we have ‘$nofollow’ with the number of links that contain ‘nofollow’, and we have ‘$total_links’ with the total number of links we got from the page. The next step is to compare them to determine the result:

  • If total links is greater than 0 and the number of nofollow links is less than the total number of links, we have at least one good links (success).
  • If total links is greater than 0 but the number of nofollow links is equal to (or greater than, just in case) the total number of links, then we only have nofollow links on this page (semi-failure).
  • If the total number of links is 0, then we have no links on this page (failure).
  • If there is any other result, I don’t know what happened :lol:

This script has much more functionality than our first version. Instead of simply checking for our domain name on the page, we are counting all instances of actual links and checking if they have nofollow attached to them. The next step is to add a user interface and the ability to check more than one page at a time.

Phase 3: User Interface and Multiple Pages

To complete this next phase, we are really only adding an HTML form and doing some simple modifications to the programming. Here is the result:
<?php
if (isset($_POST['submit'])) {

# define some variables
$needle = trim($_POST['needle']);
$partners = explode(“n”, trim($_POST['partners']));

foreach ($partners AS $haystack) {
# get the remote page contents
$haystack_source = file_get_contents(trim($haystack));

## get our links
$pattern = ‘/<a[^>]*href=[^>]*’.$needle.’(.*?)</a>/i’;
preg_match_all($pattern, $haystack_source, $matches);

## throw away extra data
$matches = $matches[0];
$total_links = count($matches);

## go through all links
$nofollow = 0;
foreach ($matches AS $match) {
if (strpos($match, ‘nofollow’)) {
$nofollow++;
}
}

if ($total_links > 0 && $nofollow < $total_links) {
print “SUCCESS: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed<br />”;
}
else if ($total_links > 0 && $nofollow >= $total_links) {
print “NOFOLLOW: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed<br />”;
}
else if ($total_links == 0) {
print “FAILURE: “.$haystack.” – “.$total_links.” links, “.$nofollow.” nofollowed<br />”;
}
else {
print “UNKNOWN RESULT<br />”;
}
}
}
?>
<strong></strong>
<form method=”post”>
<input type=”text” name=”needle” /><br />
<textarea name=”partners”></textarea><br />
<input type=”submit” name=”submit” value=”Submit” />
</form>

We start by only running the programming if the form was submitted. Then we get our list of partner sites (separated by a line break) and turn them into an array. Next, use ‘foreach’ again to loop through each one of the partner sites and run essentially the same code from phase 2 each time.

I have uploaded this final phase (with error checking and CSS styling) so that you can check on sites that are linking to you.

What Else Could You Do?

This script will work for most purposes as it is, but there are plenty of other things you could add to it. You could store the list of partner sites directly in the file and run the script via an automatic cron job, having it email you if it finds any missing links. Or, you could incorporate AJAX techniques to check an unlimited amount of pages without worrying about the script timing out.

If you wanted to get really advanced, you could write a script that crawls an entire website while looking for links. However, that would be very server intensive and probably impractical.

I hope this post taught you a little about spidering web pages with PHP. Feel free to copy and paste the source code to a PHP file on your own server or just test it out on mine.

  • Share/Bookmark