Remove html tags using ereg_replace

@solomonMar 30.2005

Hi All

I’ve got a reasonable amount of know-how with php now and in my time I’ve seen these sceeeery scripts resembling something the enigma encrypter would churn out (for example, $text = ereg_replace(“[[:alpha:]]+://[^{^<>[:space:]]+[[:alnum:]/]”,”<a} href=”0″>0</a>”, $text); – omgwtf 😮 runawaaayyy!!!)

Anyway – I find myself in the situation where I’m reading a webpage source code into a string and I need to strip the html tags out of it and leave just the displayable content. I realise I need to use ereg_replace or preg_replace (do I?) but I’m jiggered if I know how to use these expressions and the manuals aren’t much help!

For example if I read this into a string…

[code=html] <html> <head> <title>Something wicked this way comes</title> </head> <body> This is the interesting stuff I want to extract </body> </html> [/code]

I want to end up with just this:

[code=html]Something wicked this way comes This is the interesting stuff I want to extract[/code]

I’d massively appreciate being given the line of code that performs this little piece of magic and if you can explain exactly what it’s doing then I might even pop round and give you a hug ?

Thanks loads already

Sol

to post a comment

PHP

12 Comments(s) _↴

@scragarMar 30.2005 — #

[code=php]$txt = "<html>
 <head>
 <title>Something wicked this way comes</title>
 </head>
 <body>
 This is the interesting stuff I want to extract
 </body>
 </html>";
 $text = ereg_replace("/<([^<>]*)>/", "", $txt);
 [/code]

@solomonauthorMar 30.2005 — #hmm... doesn't quite seem to work fully. It looks like it would probably work on the simplified example of html I gave but not on 'real world' html. Perhaps I wasn't specific enough :o

Basically, if anything appears between a '<' and a '>' it must be removed (along with the '<' and '>')... whether it's a character, a number, a space, a slash, equals, quotes, javascript.... anything!

Any other ideas anyone?

Thanks for a damn good stab tho, scragar ?

@scragarMar 30.2005 — #that works.

http://scragar.mybesthost.com/2.php

@solomonauthorMar 30.2005 — #This is me eating my words:

:eek:[SIZE=1]my words[/SIZE]

It looks like I owe you a bit of an apology, scragar. Sorry. It seems to work just fine. I obviously had difficulty squeezing it into my code - I was getting results that simply weren't working and it would appear that it was my fault. Thanks for going to the effort of uploading that bit of script for me - you da man.

Anyway - grovelling aside - could you be wonderful enough to explain how that special line of code is working? Cheers.

@JonaMar 30.2005 — #?

[font=trebuchet ms]Hate to make things easier for you guys, but...[/font]

[code=php]
 $htmlStr = striptags($htmlStr);
 [/code]

@MarkLMar 30.2005 — #?

[font=trebuchet ms]Hate to make things easier for you guys, but...[/font]

[code=php]
 $htmlStr = striptags($htmlStr);
 [/code]

[/QUOTE]

Darn those predefined functions, they take all the fun out of scripting!

BTW - I have always used this function as strip_tags(), not sure if both syntaxes are valid.

@JonaMar 30.2005 — #Darn those predefined functions, they take all the fun out of scripting!

BTW - I have always used this function as strip_tags(), not sure if both syntaxes are valid.[/QUOTE]
[font=trebuchet ms]Why, so it [i]is[/i] [/font][font=courier new]strip_tags()[/font][font=trebuchet ms]! Darn those predefined functions, you never know when they have an underscore or where that underscore may be![/font]

@Stephen_PhilbinMar 30.2005 — #I find that [url=http://www.zend.com/phpfunc/
]this little scamp[/url] is quite a help. ?

@solomonauthorMar 30.2005 — #?

[font=trebuchet ms]Hate to make things easier for you guys, but...[/font]

[code=php]
 $htmlStr = striptags($htmlStr);
 [/code]

[/QUOTE]
*THWACK* (sound of palm rapidly meeting forehead)

You also just inspired me here - I was trying to figure out how to get ereg_replace to strip out all html entities with no joy. After hunting through php.net I discover html_entity_decode() !!

So far, my little portion of script looks like this:

[code=php]
 <?php 
 $words = strip_tags(file_get_contents('http://'.$url));
 $words = html_entity_decode($words);
 $words = preg_replace("/[[:punct:]]/", "", $words); 
 $words = preg_replace("/[[:space:]]/", " ", $words); 
 ?>
 [/code]

:rolleyes: You should have seen the first incarnation of this script - not quite so neat.

I'd be interested if anyone can polish it up a little more tho!

AND... if someone could at least point me in the direction of instruction on how to write the business end of ereg_replace I will be eternally grateful! ?

@JonaMar 30.2005 — #[font=trebuchet ms]Your code has [i]preg_replace[/i], but your RegEx syntax doesn't look Perl-compatible. I think you meant [i]ereg_replace[/i]. Am I missing something?[/font]

@Stephen_PhilbinMar 31.2005 — #See that's the problem with regexp. I swear they make a new type of it to celebrate the birth of every bunny in this world.

@solomonauthorMar 31.2005 — #[font=trebuchet ms]Your code has [i]preg_replace[/i], but your RegEx syntax doesn't look Perl-compatible. I think you meant [i]ereg_replace[/i]. Am I missing something?[/font][/QUOTE]

You aren't missing anything! Firstly, as I keep saying, I don't understand the first thing about that syntax - secondly, I just copied that bit from somebody elses script, warts and all! :o so I shall change it to ereg straight away ?

Hmm.. just tried changing it and it stops doing quite what I need - it doesn't remove all punctuation and numbers ? n/m - I shall just have to deal with having slightly clumsy code? wouldn't be the first time!

I have just finished working on my little script - would you like to see it in operation?

http://www.thrutch.co.uk/code/passwords/

constructive criticism always welcome

Also in #PHP _↴

preg_match function question php directories....mediawiki patch (on a windows box?)

Success!

Help @solomon spread the word by sharing this article on Twitter...

Tweet This

Remove html tags using ereg_replace

12 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

Remove html tags using ereg_replace

12 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

12 Comments(s) _↴

Also in #PHP _↴