Improved web scraping

When you want to start web scraping it’s always indicated that you use proxies to protect your host from getting mislead results or to get banned from the source you are trying to scrape.

Web crawling it’s not that hard to do, but it requires some extra steps to make sure that you don’t get bad results.

An example of a simple web crawler:

<?php 
#plain and simple web scraper
$str = file_get_contents('http://example.com');

echo $str;
?>

As I mentioned earlier it’s not enough, I guess you will want a long term solution. First of all you will have to make sure that you find something constant in the website, to ensure that you always fetch the full page.

My first suggestion when scraping a page is to add a retry option and search for that constant you found in the website.

<?php 
$retries = 4;

for($int = 0; $int < $retries; $int++)
{
    $str = file_get_contentst('http://example.com');
    if( substr($str, '<constant valued>') === true)
    {
        // when scraped value was found break the for
        break;
    }
}

echo $str;
?>

In this example we just wanted to demonstrate how easy it is to do basic web scraping in a website and a quick method to improve the results. In the next post we will be adding proxy support and curl for our web crawler.