PHP: How to get all images from an HTML page

I was curious to how I could make something similar to what Facebook does when you add a link. Somehow it loads images found on the page your link leads to, and then it presents them to you so you can select one you want to use as a thumbnail.

Well, step one to solve this is of course to find all the images on a page, and that is what I will present in this post. It will be sort of like a backend service we can use later from an AJAX call. You post it a URL, and you get all the image URLs it found back. Let's put the petal to medal!

Getting the URL

We'll use POST for this, and it shouldn't require much explanation.

$url = array_key_exists('url', $_POST) ? $_POST['url'] : null;

Loading the HTML

To load the HTML we'll use the handy cURL library, which I've used in earlier posts as well.

$request = curl_init();
curl_setopt_array($request, array
(
    CURLOPT_URL => $url,

    CURLOPT_RETURNTRANSFER => TRUE,
    CURLOPT_HEADER => FALSE,

    CURLOPT_SSL_VERIFYPEER => TRUE,
    CURLOPT_CAINFO => 'cacert.pem',

    CURLOPT_FOLLOWLOCATION => TRUE,
    CURLOPT_MAXREDIRS => 10,
));
$response = curl_exec($request);
curl_close($request);

We just create and execute a request for the supplied URL. The response is stored in a variable so we can use it later.

💡 We won't bother with any fancy error handling or anywhere else. If it doesn't work, we'll just give an empty list back. Bad URLs, faulty HTML, not our problem 😛

💡 The SSL and cacert.pem stuff is explained in an earlier post.

Parsing the HTML

You might have seen examples on how to find things in HTML using regular expressions. This is by most experienced developers regarded as A Bad Idea™. What we can, and probably should, use instead is the DOMDocument class you find in PHP. This class can parse XML and HTML into a neat DOM which is a lot easier to work with.

$document = new DOMDocument();
if($response)
{
    libxml_use_internal_errors(true);
    $document->loadHTML($response);
    libxml_clear_errors();
}

Not everyone writes perfectly formed HTML without errors so we load the HTML, the DOMDocument class will do its best to plow through the HTML and figure it all out. When it does come across things that are a bit out of wack, it will let us know by spitting out warnings. There's a big chance it managed to deal with it fine, but just for our information. However, like I mentioned earlier, we don't care about errors here and we definitely don't want those warnings messing up our output.

So, what we do here is to tell libxml (which is used internally) to enable user error handling instead. When we now load the HTML, those errors and warnings are then instead collected quietly. We can get to them afterwards by calling libxml_get_errors, but since we don't care about them at all we just clear them out instead. Easy peasy.

Dealing with relative URLs

As you perhaps know you can have both absolute and relative URLs in HTML. The relative URLs are relative to the base path of the HTML page. Unless the HTML page has a base tag. In that case we need to use whatever that specifies instead. So, how do we deal with all of that?

Well, I decided to stick that in a different post. What's important here is that we can get the base tag from the DOM like this:

$tags = $document->getElementsByTagName('base');

foreach($tags as $tag)
    return $tag->getAttribute('href');

Of course if there is a base tag, it should only be one, but since we get a collection back from getElementsByTagName I just use foreach for simplicity.

Next up we have a function which turn a relative URL into an absolute one. The signature looks like the following and the content you can read more about in that earlier mentioned post.

private static function make_absolute($url, $base)
{
    // "Magic"
}

Getting the images

Now to the fun part. With the HTML loaded up and the base path figured out, we just need to fetch the images.

$images = array();

foreach($document->getElementsByTagName('img') as $img)
{
    // Extract what we want
    $image = array
    (
        'src' => self::make_absolute($img->getAttribute('src'), $base),
    );

    // Skip images without src
    if( ! $image['src'])
        continue;

    // Add to collection. Use src as key to prevent duplicates.
    $images[$image['src']] = $image;
}
$images = array_values($images);

Now that was pretty simple, wasn't it? The perhaps weird thing here is that I first collect them in an array, using the image source as a key. This way we won't end up with duplicate image URLs. We also skip images which for some reason were to not have any URL.

We could of course also get other stuff from the image tag here, like the height, width, alt text or title, or we could do some more elaborate filtering to try to weed out uninteresting images. For example really tiny ones or whatever else we might think of.

Echo it all out

Now we just need to echo it all out in a format we can use easily on the client side. And we will, surprise, surprise, of course use our good friend JSON for this 🙂

header('content-type: application/json; charset=utf-8');
$result = array('images' => $images);
echo json_encode($result);

We set the appropriate header, wrap it up in a result array, encode it, and spit it out. And that's pretty much all there is to that part.

Complete code

Here's the complete code as a class, and usage of that class:

Class

<?php

/**
 * Finds images from a given URL.
 *
 * @author   Torleif Berger
 * @link     http://www.geekality.net/?p=1585
 * @license  http://creativecommons.org/licenses/by/3.0/
 */
class ImageFinder
{
    private $document;
    private $url;
    private $base;

    /**
     * Creates a new image finder object.
     */
    public function __construct($url)
    {
        // Store url
        $this->url = $url;
    }

    /**
     * Loads the HTML from the url if not already done.
     */
    public function load()
    {
        // Return if already loaded
        if($this->document)
            return;

        // Get the HTML document
        $this->document = self::get_document($this->url);

        // Get the base url
        $this->base = self::get_base($this->document);
        if( ! $this->base)
            $this->base = $this->url;
    }

    /**
     * Returns an array with all the images found.
     */
    public function get_images()
    {
        // Makes sure we're loaded
        $this->load();

        // Image collection array
        $images = array();

        // For all found img tags
        foreach($this->document->getElementsByTagName('img') as $img)
        {
            // Extract what we want
            $image = array
            (
                'src' => self::make_absolute($img->getAttribute('src'), $this->base),
            );

            // Skip images without src
            if( ! $image['src'])
                continue;

            // Add to collection. Use src as key to prevent duplicates.
            $images[$image['src']] = $image;
        }

        // Return values
        return array_values($images);
    }

    /**
     * Gets the html of a url and loads it up in a DOMDocument.
     */
    private static function get_document($url)
    {
        // Set up and execute a request for the HTML
        $request = curl_init();
        curl_setopt_array($request, array
        (
            CURLOPT_URL => $url,

            CURLOPT_RETURNTRANSFER => TRUE,
            CURLOPT_HEADER => FALSE,

            CURLOPT_SSL_VERIFYPEER => TRUE,
            CURLOPT_CAINFO => 'cacert.pem',

            CURLOPT_FOLLOWLOCATION => TRUE,
            CURLOPT_MAXREDIRS => 10,
        ));
        $response = curl_exec($request);
        curl_close($request);

        // Create DOM document
        $document = new DOMDocument();

        // Load response into document, if we got any
        if($response)
        {
            libxml_use_internal_errors(true);
            $document->loadHTML($response);
            libxml_clear_errors();
        }

        return $document;
    }

    /**
     * Tries to get the base tag href from the given document.
     */
    private static function get_base(DOMDocument $document)
    {
        $tags = $document->getElementsByTagName('base');

        foreach($tags as $tag)
            return $tag->getAttribute('href');

        return NULL;
    }

    /**
     * Makes sure a url is absolute.
     */
    private static function make_absolute($url, $base)
    {
        // Return base if no url
        if( ! $url) return $base;

        // Already absolute URL
        if(parse_url($url, PHP_URL_SCHEME) != '') return $url;

        // Only containing query or anchor
        if($url[0] == '#' || $url[0] == '?') return $base.$url;

        // Parse base URL and convert to local variables: $scheme, $host, $path
        extract(parse_url($base));

        // If no path, use /
        if( ! isset($path)) $path = '/';

        // Remove non-directory element from path
        $path = preg_replace('#/[^/]*$#', '', $path);

        // Destroy path if relative url points to root
        if($url[0] == '/') $path = '';

        // Dirty absolute URL
        $abs = "$host$path/$url";

        // Replace '//' or '/./' or '/foo/../' with '/'
        $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
        for($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}

        // Absolute URL is ready!
        return $scheme.'://'.$abs;
    }
}

Usage

<?php

// Set content-type
header('content-type: application/json; charset=utf-8');


// Get the url
$url = array_key_exists('url', $_POST)
    ? $_POST['url']
    : null;


// Create image finder
include('image_finder.class.php');
$finder = new ImageFinder($url);


// Get images
$images = $finder->get_images();


// Output result
$result = array('images' => $images);
ob_start('ob_gzhandler');
echo json_encode($result);