Replace HTML Special Characters With Entities - But Without Touching Tags

Posted in PHP by Karl on June 5, 2009

I came a across a problem during the development of a CMS at work where I had to take a string of HTML source code and make sure all special html characters are replaced with their entities. For example, & (ampersand) should become &.

PHP has a couple of useful functions for this sort of thing, namely htmlentities and htmlspecialchars. However running my string through either of these was no good to me because doing so would convert the characters used in the html tags too. For example, the following:

1	<p class="foo">This is a paragraph & that ampersand needs fixing</p>

Would become:

1	<p class="foo">This is a paragraph & that ampersand needs fixing</p>

The ampersand is converted nicely, but now the HTML is useless. The first thought that struck me was to parse the string using php’s XML parser in order to get at the cdata directly, but of course that idea didn’t last long since the very characters I was trying to fix would have broken the parser.

In the end I settled on using a regular expression to match content in between tags, but leave the tags themselves alone. I also added some functionality to leave anything between tags along so I could pass though HTML with embedded PHP and not have it break.

Here is the function. It is coded to work with UTF-8, hence the multibyte functions and the /u modifier on the regex, but if you are working with a single byte character set you can just swap this out accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

<?php
function clean_entities($string) {

$string = htmlspecialchars_decode($string);

$parts = preg_split('/(<\?.*?\?>)/us', $string, -1, PREG_SPLIT_DELIM_CAPTURE);

$string = '';

foreach ($parts as $part) {
if (false === mb_strpos(trim($part), '<?')) {
$string .= preg_replace_callback(
'/(?<=\>)((?![<](\?|\/)*[a-z][^>]*[>]).)+/ius',
create_function(
'$matches',
'return htmlspecialchars($matches[0]);'
),
$part
);
} else {
$string .= $part;
}
}

return $string;

}
?>

This results in nice valid entities, but the tags and any embedded php are left alone:

1	<p class="foo">This is a paragraph & that ampersand <?php echo "has been" ?> fixed!</p>

7 comments

Claire Allen November 13, 2009 at 1:49 am

sorry… i’m not really sure which parts I need to change to replace a word in the main content of a site, that might also appear in an alt tag?

Karl November 14, 2009 at 12:07 am

Hi Claire

If you just want to do a simple replacement and don’t mind where the word might appear, you would be better off using PHP’s str_replace() function:

1
2
3
4

$word = "foo";
$replace = "bar";
$string = "<img src="hello.png" alt="This is a nice foo image" /><p>This is a nice foo string</p>";
echo str_replace($word, $replace, $string); // <img src="hello.png" alt="This is a nice bar image" /><p>This is a nice bar string</p>

Christian August 17, 2010 at 7:52 pm

Hey karl … I thought this was really helpful. Would you be alright if I put a trackback to your blog on this topic? I am just now building a site http://www.redbonzai.com and I want to feature this on it.

Christian
Karl August 19, 2010 at 7:03 pm

Hi Christian,

You’re very welcome to use anything you find here however you like.

Karl
Christian October 6, 2010 at 4:45 pm

Hey Karl …

I am creating a new tutorial that is intended to create a shopping cart using CodeIgniter.
Feel free to peruse it if you wish.
http://www.redbonzai.com/blog/web-development/how-to-create-a-digital-shopping-cart-with-codeigniter/

I am currently doing quite a bit with wordpress also, like creating e-commerce applications inside, and custom CMS backends.

If you would like to collaborate, or exchange ideas you are more than welcome.

Thanks for your time.

Christian
rene January 11, 2012 at 5:44 pm

Hey Karl,

Really nice function! It works good in most cases but currently, the function escapes a bit too much if you have content like the following:

→

In this case it shouldn’t escape the ampersand but it does. Unfortunately, I don’t know how to add this special case easily to your regular expression.

best
,rené
Karl January 11, 2012 at 8:38 pm

Hi rené,

Hmm it should be able to handle that. I’ll take a look and post an update when I figure out what’s going on. Thanks for letting me know!

Karl

Replace HTML Special Characters With Entities - But Without Touching Tags

7 comments

Leave a Reply Cancel reply