CakePHP Valid XHTML/XML Behavior

If you have ever written a CMS-type application where you accept input from users to be stored as valid XHTML, you will probably have come up against some problems!

Generally this task is accomplished by using a javascript WYSIWYG real-time editor on the client side in order to keep things simple for content editors, and the resulting markup is stored on the server. Often though, content editors tend to work in Microsoft Word and paste their content into the javascript editor. That’s where the fun begins! Windows uses its own character set (thanks Microsoft!) known as code page 1252 which, whilst being mostly compatible with the much more common latin-1 character set, is not something you generally want to use on the web – UTF-8 is a much more sensible way to go. If the content is to be stored in a database, you also need to ensure it matches the character set used by your table.

Aside from Microsoft-induced headaches, you often have little control over the markup itself. Even the best javascript editors don’t get everything 100% correct all the time, and as well as technological issues there is also potential for human error (inserting unencoded html entities for example).

All in all then, you can’t really trust the markup you receive to be valid UTF-8 encoded XHTML. I found myself in this position during the development of a CMS using CakePHP, so I decided to write a Model Behaviour which can be used to clean up strings of markup to ensure they are valid and properly encoded. It ensures that the content is free of code page 1252 characters by converting them to UTF-8, replaces any unencoded HTML entities with their properly encoded equivalents (e.g. & -> &), fixes any invalid XHTML, and cleans and tidies the source code nicely.

The behaviour’s configuration is pretty simple. You just need to specify which fields should be automatically processed before they are saved:

1
2
3
4
5
public $actsAs = array(
    'ValidXhtml' => array(
        'fields' => array('content')
    )
);

You can optionally specify whether to tidy the markup for each field using the PHP Tidy extension (default is true, so you only need to specify this to disable tidy):

1
2
3
4
5
6
7
public $actsAs = array(
    'ValidXhtml' => array(
        'fields' => array(
            'content' => array('tidy' => false)
        )
    )
);

Obviously you need to have the tidy extension available to use that feature, but the Behaviour checks for the extension and will automatically configure itself accordingly, so there is no need to explicitly disable tidy if you don’t have it installed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
<?php

class ValidXhtmlBehavior extends ModelBehavior {
   
    private $_defaults = array(
        'fields' => array()
    );
   
    private $_preProcessMap = array(
        // replace empty div tags with a div containing &nbsp;
        '/<div>\s*<\/div>/i' => '<div>&nbsp;</div>'
    );
   
    // Map of windows 1252 character points to utf-8 character points
    private $_cp1252Map = array(
        "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
        "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
        "\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
        "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
        "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
        "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
        "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
        "\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
        "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
        "\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
        "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
        "\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
        "\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
        "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
        "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
        "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
        "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
        "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
        "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
        "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
        "\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
        "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
        "\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
        "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
        "\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
        "\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
        "\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
    );
   
    // Map of utf-8 chracter points to special html entities
    private $_entMap = array(
        "\xe2\x80\x98" => '&lsquo;',
        "\xe2\x80\x99" => '&rsquo;',
        "\xe2\x80\x9c" => '&ldquo;',
        "\xe2\x80\x9d" => '&rdquo;',
        "\xe2\x82\xac" => '&euro;',
        "\xe2\x80\xa6" => '&hellip;'
    );
   
    /*
     For reference, these are other entity replacement codes which might be useful one day
    array(
        "\xe2\x80\x9a" => '&sbquo;',    // Single Low-9 Quotation Mark
        "\xe2\x82\xac" => '&euro;',     // Euro sign
        "\xc6\x92"     => '&fnof;',     // Latin Small Letter F With Hook
        "\xe2\x80\x9e" => '&bdquo;',    // Double Low-9 Quotation Mark
        "\xe2\x80\xa6" => '&hellip;',   // Horizontal Ellipsis
        "\xe2\x80\xa0" => '&dagger;',   // Dagger
        "\xe2\x80\xa1" => '&Dagger;',   // Double Dagger
        "\xcb\x86"     => '&circ;',     // Modifier Letter Circumflex Accent
        "\xe2\x80\xb0" => '&permil;',   // Per Mille Sign
        "\xc5\xa0"     => '&Scaron;',   // Latin Capital Letter S With Caron
        "\xe2\x80\xb9" => '&lsaquo;',   // Single Left-Pointing Angle Quotation Mark
        "\xc5\x92"     => '&OElig;',    // Latin Capital Ligature OE
        "\xe2\x80\x98" => '&lsquo;',    // Left Single Quotation Mark
        "\xe2\x80\x99" => '&rsquo;',    // Right Single Quotation Mark
        "\xe2\x80\x9c" => '&ldquo;',    // Left Double Quotation Mark
        "\xe2\x80\x9d" => '&rdquo;',    // Right Double Quotation Mark
        "\xe2\x80\xa2" => '&bull;',     // Bullet
        "\xe2\x80\x93" => '&ndash;',    // En Dash
        "\xe2\x80\x94" => '&mdash;',    // Em Dash
        "\xcb\x9c"     => '&tilde;',    // Small Tilde
        "\xe2\x84\xa2" => '&trade;',    // Trade Mark Sign
        "\xc5\xa1"     => '&scaron;',   // Latin Small Letter S With Caron
        "\xe2\x80\xba" => '&rsaquo;',   // Single Right-Pointing Angle Quotation Mark
        "\xc5\x93"     => '&oelig;',    // Latin Small Ligature OE
        "\xc5\xb8"     => '&Yuml;',     // Latin Capital Letter Y With Diaeresis
    );
    */

   
    public function setup($model, $config = array()) {
        $this->settings[$model->alias] = array_merge($this->_defaults, (array) $config);
    }
   
    public function beforeSave($model) {
        if (!empty($this->settings[$model->alias]['fields'])) {
            foreach ($this->settings[$model->alias]['fields'] as $key => $value) {
                if (is_array($value)) {
                    $options = $value;
                    $field = $key;
                } else {
                    $field = $value;
                }
                $options['tidy'] = isset($options['tidy']) ? $options['tidy'] : true;
                if (isset($model->data[$model->alias][$field])) {
                    $model->data[$model->alias][$field] =
                        $this->makeValid($model->data[$model->alias][$field], $options['tidy']);
                }
            }
        }
        return true;
    }
   
    public function makeValid($string, $tidy = true) {
       
        $string = trim($string);
       
        // apply the pre-process map
        $string = preg_replace(array_keys($this->_preProcessMap), $this->_preProcessMap, $string);
       
        // apply the windows > utf8 map
        $string = str_replace(array_keys($this->_cp1252Map), $this->_cp1252Map, $string);
       
        // get rid of any existing html entities to avoid double encoding
        $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
       
        // break out any PHP sections since they should not be touched
        $parts = preg_split('/(<\?.+?\?>)/us', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
       
        // replace &, ", ', < and > with their entities, but only where they are not
        // part of an html tag or a comment
        $string = '';
        foreach ($parts as $part) {
            if (false === mb_strpos(trim($part), '<?')) {
                $string .= preg_replace_callback(
                    '/(?<=\>)((?![<](\?|\/)*[a-z][^>]*[>])[^<])+/ius',
                    create_function(
                        '$matches',
                        'return htmlspecialchars($matches[0]);'
                    ),
                    $part
                );
            } else {
                $string .= $part;
            }
        }
       
        // apply the utf-8 > entities map
        $string = str_replace(array_keys($this->_entMap), $this->_entMap, $string);
       
        // trim whitespace from the end of each line and add a nice \n
        // tinymce in particular seems to have a bug where it will insert spaces
        // at the end of lines - this can cause problems with things like Revision
        // Behavior as the values of some fields will never be the same so a revision
        // is always saved even if the data itself has not changed.
        $parts = preg_split("/[\r\n]+/u", $string);
        foreach ($parts as &$part) {
            $part = rtrim($part);
        }
        $string = implode("\n", $parts);
       
        // tidy the output
        if ($tidy && extension_loaded('tidy')) {
            $tidy_config = array(
                'output-xhtml' => true,
                'show-body-only' => true,
                'indent' => true,
                'indent-spaces' => 4,
                'sort-attributes' => 'alpha',
                'wrap' => 80,
                'preserve-entities' => true,
                'join-styles' => false,
                'logical-emphasis' => true,
                'enclose-text' => true
            );
            $tidy = tidy_parse_string($string, $tidy_config, 'UTF8');
            $tidy->cleanRepair();
            $string = $tidy;
        }
       
        return $string;
       
    }
   
}

?>

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>