Tag Archives: HTML

PHP: How to get all images from an HTML page

rgbstock.com
I was curious to how I could make something similar to what Facebook does when you add a link. Somehow it loads images found on the page your link leads to, and then it presents them to you so you can select one you want to use as a thumbnail.

Well, step one to solve this is of course to find all the images on a page, and that is what I will present in this post. It will be sort of like a backend service we can use later from an AJAX call. You post it a URL, and you get all the image URLs it found back. Let’s put the petal to medal!

Continue reading

PHP: Dealing with absolute and relative URLs

I’m currently writing a post on how to get image tags from a remote HTML page using PHP. One sticky issue with that is that the image URLs you find is a joyful mix of absolute and relative URLs.

Luckily, I found a function on nashruddin.com which seem to handle them alright. After a bit of clean up and fixing an error, we have this function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?php
function make_absolute($url, $base)
{
    // Return base if no url
    if( ! $url) return $base;

    // Return if already absolute URL
    if(parse_url($url, PHP_URL_SCHEME) != '') return $url;
   
    // Urls only containing query or anchor
    if($url[0] == '#' || $url[0] == '?') return $base.$url;
   
    // Parse base URL and convert to local variables: $scheme, $host, $path
    extract(parse_url($base));

    // If no path, use /
    if( ! isset($path)) $path = '/';
 
    // Remove non-directory element from path
    $path = preg_replace('#/[^/]*$#', '', $path);
 
    // Destroy path if relative url points to root
    if($url[0] == '/') $path = '';
   
    // Dirty absolute URL
    $abs = "$host$path/$url";
 
    // Replace '//' or '/./' or '/foo/../' with '/'
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {}
   
    // Absolute URL is ready!
    return $scheme.'://'.$abs;
}

I can sort of read through and see what it does, but I can’t explain it very well. So, I’ll just leave it at that. So far it has worked fine for me. Maybe some corner cases that are missing, and if there are, please let me know!

:idea: What I added to the original function was line 5 and 17. The first to prevent it from crashing if the url is null or empty, and the second to prevent it from crashing if parse_url finds no path. For example if the url is http://www.example.com (No /whatever at the end).

The base tag

A tag that is easy to forget about is the base tag. The above function gets the base path from the URL given as base. For example if you gave it http://www.example.com/directory/file.html as base, it would use http://www.example.com/directory/. However, if file.html included the following base tag:

<base href="http://www.example.com/">

Then the base path would be http://www.example.com/ instead. Fun, eh?

As long as you know about it, it’s not to hard to deal with though. You just need to get a hold of it and provide that as base instead when using the function above.

Works On My Machineā„¢! And if it doesn’t on yours, let me know. If it’s a mistake in the function, I’d like to fix it!