i22 Logo Icon

simply advanced

HTML to Text : How to extract text content from HTML source code

While working on a project, we recently encountered a small problem.

Within our panel, a user could edit content (which automatically generated HTML source code) and was being saved. We had the HTML source code of our content saved in our database, however we needed to read only the text content to generate tags.

To solve this, we resorted to Regular expressions (Reg-ex) as a quick and easy way to strip out all code and only retain the text.

Here is our PHP function

function plaintext($html)
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('##s', '', $html);

// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('##', ' ', $plaintext);

// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)#is', "", $plaintext);

// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#]*?>#", " ", $plaintext);

// remove all remaining html
$plaintext = strip_tags($plaintext);

return $plaintext;

via i22
i22.in is publishing this article under the Attribution-NoDerivs 3.0 License by Creative Commons.

Already have an idea?  Get A Quote

Starting something new ?   Startup technology partner

Revamp your business   SME solutions partner