PHP: How to get all images from an HTML page

rgbstock.com
I was curious to how I could make something similar to what Facebook does when you add a link. Somehow it loads images found on the page your link leads to, and then it presents them to you so you can select one you want to use as a thumbnail.

Well, step one to solve this is of course to find all the images on a page, and that is what I will present in this post. It will be sort of like a backend service we can use later from an AJAX call. You post it a URL, and you get all the image URLs it found back. Let’s put the petal to medal!

Getting the URL

We’ll use POST for this, and it shouldn’t require much explanation.

$url = array_key_exists('url', $_POST)
    ? $_POST['url']
    : null;

Loading the HTML

To load the HTML we’ll use the handy cURL library, which I’ve used in earlier posts as well.

$request = curl_init();
curl_setopt_array($request, array
(
    CURLOPT_URL => $url,
   
    CURLOPT_RETURNTRANSFER => TRUE,
    CURLOPT_HEADER => FALSE,
   
    CURLOPT_SSL_VERIFYPEER => TRUE,
    CURLOPT_CAINFO => 'cacert.pem',

    CURLOPT_FOLLOWLOCATION => TRUE,
    CURLOPT_MAXREDIRS => 10,
));
$response = curl_exec($request);
curl_close($request);

We just create and execute a request for the supplied URL. The response is stored in a variable so we can use it later.

:idea: We won’t bother with any fancy error handling or anywhere else. If it doesn’t work, we’ll just give an empty list back. Bad URLs, faulty HTML, not our problem :P

:idea: The SSL and cacert.pem stuff is explained in an earlier post.

Parsing the HTML

You might have seen examples on how to find things in HTML using regular expressions. This is by most experienced developers regarded as A Bad Idea™. What we can, and probably should, use instead is the DOMDocument class you find in PHP. This class can parse XML and HTML into a neat DOM which is a lot easier to work with.

$document = new DOMDocument();
if($response)
{
    libxml_use_internal_errors(true);
    $document->loadHTML($response);
    libxml_clear_errors();
}

Not everyone writes perfectly formed HTML without errors so we load the HTML, the DOMDocument class will do its best to plow through the HTML and figure it all out. When it does come across things that are a bit out of wack, it will let us know by spitting out warnings. There’s a big chance it managed to deal with it fine, but just for our information. However, like I mentioned earlier, we don’t care about errors here and we definitely don’t want those warnings messing up our output.

So, what we do here is to tell libxml (which is used internally) to enable user error handling instead. When we now load the HTML, those errors and warnings are then instead collected quietly. We can get to them afterwards by calling libxml_get_errors, but since we don’t care about them at all we just clear them out instead. Easy peasy.

Dealing with relative URLs

As you perhaps know you can have both absolute and relative URLs in HTML. The relative URLs are relative to the base path of the HTML page. Unless the HTML page has a base tag. In that case we need to use whatever that specifies instead. So, how do we deal with all of that?

Well, I decided to stick that in a different post. What’s important here is that we can get the base tag from the DOM like this:

$tags = $document->getElementsByTagName('base');

foreach($tags as $tag)
    return $tag->getAttribute('href');

Of course if there is a base tag, it should only be one, but since we get a collection back from getElementsByTagName I just use foreach for simplicity.

Next up we have a function which turn a relative URL into an absolute one. The signature looks like the following and the content you can read more about in that earlier mentioned post.

private static function make_absolute($url, $base)
{
    // "Magic"
}

Getting the images

Now to the fun part. With the HTML loaded up and the base path figured out, we just need to fetch the images.

$images = array();

foreach($document->getElementsByTagName('img') as $img)
{
    // Extract what we want
    $image = array
    (
        'src' => self::make_absolute($img->getAttribute('src'), $base),
    );
   
    // Skip images without src
    if( ! $image['src'])
        continue;

    // Add to collection. Use src as key to prevent duplicates.
    $images[$image['src']] = $image;
}
$images = array_values($images);

Now that was pretty simple, wasn’t it? The perhaps weird thing here is that I first collect them in an array, using the image source as a key. This way we won’t end up with duplicate image URLs. We also skip images which for some reason were to not have any URL.

We could of course also get other stuff from the image tag here, like the height, width, alt text or title, or we could do some more elaborate filtering to try to weed out uninteresting images. For example really tiny ones or whatever else we might think of.

Echo it all out

Now we just need to echo it all out in a format we can use easily on the client side. And we will, surprise, surprise, of course use our good friend JSON for this :)

header('content-type: application/json; charset=utf-8');
$result = array('images' => $images);
echo json_encode($result);

We set the appropriate header, wrap it up in a result array, encode it, and spit it out. And that’s pretty much all there is to that part.

Working sample

When I implemented this I decided to wrap most of this up in a class I called Image_Finder. I have also created a tiny interface for this so you can test it out for yourself. Will go through that interface (or a variant of it) later, but since that’s all HTML and JavaScript you are of course more than welcome to have a peek at the source of it :)

Either way, you can find it all, code and sample, at samples.geekality.net/image-fetcher. Let me know if you like it. And definitely let me know if you find any bugs ;)

Bugs, improvements, suggestions, praise; please leave a comment. I like to learn :)

  • Niels van Kerkhoven

    The program for fetching the images works great.
    However when I run it on my server it wants a file named get-images.php but I cannot find it on your site.
    Could you help me?

    • http://www.geekality.net Torleif

      Seems to be a left-over from while I was messing around with this. It should probably be changed to scan.php. See there was a slight mismatch between my local version and the one uploaded as well. Have re-uploaded it, so hopefully it’s error free now :p

      Thanks for the heads up!

      • Niels van Kerkhoven

        With the new version firefox and IE opens the scan.php for saving/opening (localhost version).

        • http://www.geekality.net Torleif

          Then I think you’re doing something wrong. Works here in all my browsers. I’m guessing you have forgotten the JavaScript file, or perhaps you have the path wrong? And that’s probably what was wrong before too actually, when I come to think about it…

  • Niels van Kerkhoven

    when adding the url of my site the reply is not shown.
    Is this correct?

    • http://www.geekality.net Torleif

      Depends what your site is :P Tried using the domain of your email address and got 8 images. So if that’s your URL, then it should get some images :)

      If you have an online version of this I could try it and see if I find something obviously wrong.

  • Niels van Kerkhoven

    Could you please run the imagefetcher from the url I supplied in the previous post?
    Maybe it works for you.

    • http://www.geekality.net Torleif

      No, the scan.php file doesn’t return anything. So that’s where the error must be.

      By the way, when you copy things from others, don’t hot-link graphics and scripts, and definitely don’t leave tracking code. You have just copied the whole HTML of the index page, which is not good. Please remove the Clicky script code at the bottom of the page at least.

  • Niels van Kerkhoven

    Could it be that your scan file is different then the one from the website, because that is the code I copied.
    BTW All links are removed

    • http://www.geekality.net Torleif

      The one in the sample is the one. The source code viewer is dynamic, so it’s the same as the contents of the actual file used. So, shouldn’t be any problems there.

      Thank you for removing the links :)

  • Graham

    Having similar problems with script execution. Think the version you uploaded may be faulty.

    • http://www.geekality.net Torleif

      Rather than thinking it is faulty, how about finding the fault? ;) How does it not work? Do you get an error message? When you view the source of the sample, you see the actual code that is running. And if what I uploaded was faulty, then the sample wouldn’t run. And it runs fine here on all my browsers, no problems at all.

      Check instead your script and your extensions. Does your web server support cURL for example?

  • Pinazinho

    if you add on line 70 of image_fetcher.class.php this code:

    $numargs = func_num_args();
    if ($numargs == 2) {
    list($width, $height, $type, $attr) = getimagesize(@$image['src']);
    if($width <= func_get_arg(0) && $height <= func_get_arg(1) )
    continue;
    }

    and change line 49:
    public function get_images()

    for this:
    public function get_images()

    • http://www.geekality.net/ Torleif Berger

      Yeah, you could probably add a lot of functionality and smartness here, but the main points were basic cURL usage and how to parse HTML without using regex :)

  • Óscar Palacios Ruiz

    Great coding. I had written a small method on my own, but it was really slow, and I have to deliver this today! I included your class in my code, but of course retained your credits. Thank you so very much.

    • http://www.geekality.net/ Torleif Berger

      Thanks! Fun to know the code is useful for others :)

  • Tommaso Cardone

    Thank you for this script :)

  • glurl

    Hi, good coding. I made some changes to suit my needs. I need the script returns only the url links of a page. Basically, I got this by changing the occurrences of tags ‘src’ to ‘href’ and ‘img’ with ‘a’. However, I am having difficulty in getting the name of the link appear associated with its URL. What I did was the following:

    var image = $ (‘‘)
    . prop (data.images [n])
    . html (‘test’)
    . appendTo (‘# output’)
    . load (imageLoaded);

    }

    How do I do to every link appear in a line and with its name instead of word ‘test’? Thank you.

    • http://www.geekality.net/ Torleif Berger

      You need to get the content of the ‘a’ tag. Probably a property called innerHTML or something. Read up on the documentation for DOMDocument on php.net for how to do that. Note that this might end up a huge pain though, as links can contain text, images, and other stuff.

      • glurl

        Thank you!

  • chipi92

    Is this post alive? I cannot get the curl_exec working, it is stucked there. What can be? Any idea?

    • http://www.geekality.net/ Torleif Berger

      Well, the post has never been alive, but its author still is! No idea why it’s not working for you. You kind of need to debug it yourself. Google, PHP docs, etc. Use the curl_error function, check the return status code, etc. If all else fails, you can always post a question at StackOverflow.com. Good luck :)

  • chipi92

    It might be Twitter is blocking request from my free host provider.

    IMPORTANT!!

    The request return a json encode variable.

    I cannot get data.images without decoding the return variable from scan.php

    In function getImagesFromUrlDone:

    $(‘#output’).empty();
    data = jQuery.parseJSON(data);

    And then we can do data.images

  • Mike

    Hye, I want to display urls of images on a page with php and i don’t know how. You can help me?

    • http://www.geekality.net/ Torleif Berger

      If you can’t figure it out from the code above, then no. Try StackOverflow.com.

  • elijah

    This is an awesome plugin, one question. How would you go about using it in Laravel? I know its a noob question.
    Can this be made into a package and installed thru composer?