HTML to Text : How to extract text content from HTML source code
While working on a project, we recently encountered a small problem.
Within our panel, a user could edit content (which automatically generated HTML source code) and was being saved. We had the HTML source code of our content saved in our database, however we needed to read only the text content to generate tags.
To solve this, we resorted to Regular expressions (Reg-ex) as a quick and easy way to strip out all code and only retain the text.
Here is our PHP function
via i22 Within our panel, a user could edit content (which automatically generated HTML source code) and was being saved. We had the HTML source code of our content saved in our database, however we needed to read only the text content to generate tags.
To solve this, we resorted to Regular expressions (Reg-ex) as a quick and easy way to strip out all code and only retain the text.
Here is our PHP function
function plaintext($html)
{
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('##s', '', $html);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('##', ' ', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#
]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}