Replace HTML Special Characters With Entities – But Without Touching Tags

I came a across a problem during the development of a CMS at work where I had to take a string of HTML source code and make sure all special html characters are replaced with their entities. For example, & (ampersand) should become &.

PHP has a couple of useful functions for this sort of thing, namely htmlentities and htmlspecialchars. However running my string through either of these was no good to me because doing so would convert the characters used in the html tags too. For example, the following:

1
<p class="foo">This is a paragraph & that ampersand needs fixing</p>

Would become:

1
&lt;p class="foo"&gt;This is a paragraph &amp; that ampersand needs fixing&lt;/p&gt;

The ampersand is converted nicely, but now the HTML is useless. The first thought that struck me was to parse the string using php’s XML parser in order to get at the cdata directly, but of course that idea didn’t last long since the very characters I was trying to fix would have broken the parser.

In the end I settled on using a regular expression to match content in between tags, but leave the tags themselves alone. I also added some functionality to leave anything between tags along so I could pass though HTML with embedded PHP and not have it break.

Here is the function. It is coded to work with UTF-8, hence the multibyte functions and the /u modifier on the regex, but if you are working with a single byte character set you can just swap this out accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<?php
function clean_entities($string) {
   
    $string = htmlspecialchars_decode($string);
   
    $parts = preg_split('/(<\?.*?\?>)/us', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
   
    $string = '';
   
    foreach ($parts as $part) {
        if (false === mb_strpos(trim($part), '<?')) {
            $string .= preg_replace_callback(
                '/(?<=\>)((?![<](\?|\/)*[a-z][^>]*[>]).)+/ius',
                create_function(
                    '$matches',
                    'return htmlspecialchars($matches[0]);'
                ),
                $part
            );
        } else {
            $string .= $part;
        }
    }
   
    return $string;
   
}
?>

This results in nice valid entities, but the tags and any embedded php are left alone:

1
<p class="foo">This is a paragraph &amp; that ampersand <?php echo "has been" ?> fixed!</p>

7 comments

  • Claire Allen

    sorry… i’m not really sure which parts I need to change to replace a word in the main content of a site, that might also appear in an alt tag?

  • Karl

    Hi Claire

    If you just want to do a simple replacement and don’t mind where the word might appear, you would be better off using PHP’s str_replace() function:

    1
    2
    3
    4
    $word = "foo";
    $replace = "bar";
    $string = "<img src="hello.png" alt="This is a nice foo image" /><p>This is a nice foo string</p>";
    echo str_replace($word, $replace, $string); // <img src="hello.png" alt="This is a nice bar image" /><p>This is a nice bar string</p>
  • Christian

    Hey karl … I thought this was really helpful. Would you be alright if I put a trackback to your blog on this topic? I am just now building a site http://www.redbonzai.com and I want to feature this on it.

    Christian

  • Karl

    Hi Christian,

    You’re very welcome to use anything you find here however you like.

    Karl

  • Christian

    Hey Karl …

    I am creating a new tutorial that is intended to create a shopping cart using CodeIgniter.
    Feel free to peruse it if you wish.
    http://www.redbonzai.com/blog/web-development/how-to-create-a-digital-shopping-cart-with-codeigniter/

    I am currently doing quite a bit with wordpress also, like creating e-commerce applications inside, and custom CMS backends.

    If you would like to collaborate, or exchange ideas you are more than welcome.

    Thanks for your time.

    Christian

  • rene

    Hey Karl,

    Really nice function! It works good in most cases but currently, the function escapes a bit too much if you have content like the following:

    &#8594;

    In this case it shouldn’t escape the ampersand but it does. Unfortunately, I don’t know how to add this special case easily to your regular expression.

    best
    ,rené

  • Karl

    Hi rené,

    Hmm it should be able to handle that. I’ll take a look and post an update when I figure out what’s going on. Thanks for letting me know!

    Karl

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>