Help with regular expressions

@gscOct 29.2009

Hi,

Im trying to extract information from html source code from a website. I have the source stored in a variable, $listings, and want to split it into an array of matches, $matches, using the preg_match_all function like so:

preg_match_all(‘/<li>(.*)</li>/’, $listings, $matches);

I want to match blocks of HTML found between list tags.

The above code doesn’t seem to match anything when i print out the array. I have tested the regex in a regex testing tool, so im inclined to think im using the function wrong.

I’d appreciate any ideas,

Thanks

to post a comment

PHP

8 Comments(s) _↴

@blue-eye-labsOct 29.2009 — #If you're trying to read through an HTML document why not use something like DOMDocument?

@gscauthorOct 29.2009 — #Im hoping to mine large amounts of information, will this work as quickly as regex do you know?

@blue-eye-labsOct 30.2009 — #Im hoping to mine large amounts of information, will this work as quickly as regex do you know?[/QUOTE]

I'm afraid I am not really sure but I gather regex can be a fairly intensive operation. I found an interesting discussion here:

http://stackoverflow.com/questions/1538584/regular-expression-vs-xml-functions-in-php

and here's another interesting one:

http://www.talkphp.com/tips-tricks/4476-zfs-zend_dom-domdocument-wrapper.html

I hope those help slightly.

@MindzaiOct 30.2009 — #DomDocument is a far better option than using regex. Your current regex for example, would match from an opening <li> to a closing <li>. If a page has >1 <ol> or <ul>, you are going to match from the start of the first to the end of the last, and everything in between.

@gscauthorNov 26.2009 — #Hi guys,

I took your advice and created some code for screen scraping with DOMDocument which worked brillint originally. I recently went back to my code and im getting errors with the loadHTML function such as:

<i>
 </i>ID week-1 already defined in Entity, line: 126.
 htmlParseEntityRef: no name in Entity, line: 303.

After looking at the source of the page im trying to scrape, it seems they have changed the source, and it is too badly formed for loadHTML to parse (I think).

I checked it with w3c checker and it showed 330 errors in the page with the DOCTYPE the page is set to. Im unsure if this is related, but it seems to be the case.

Does anyone have any ideas?

Much appreciated

@MindzaiNov 26.2009 — #Ask the site owner for an API. If they don't provide anything, they probably don't want people scraping their pages.

@chris22Nov 26.2009 — #This is a great tool for parsing HTML documents:

http://sourceforge.net/projects/simplehtmldom

@blue-eye-labsNov 27.2009 — #Both the XML and DOM parsers in PHP are fairly strict and don't like badly formed stuff so I think the latest suggestion made by chris22 would be the way to go.

Also in #PHP _↴

PHP and MySQL User / Security question Why do I get parse error when I add this option name?!image resize save

Success!

Help @gsc spread the word by sharing this article on Twitter...

Tweet This

about: ({
version: 0.1.9 — BETA 4.28,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});

changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...

recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...

Help with regular expressions

8 Comments(s) _↴

Also in #PHP _↴

Success!

Social

Version

Help with regular expressions

8 Comments(s) ↴

Also in #PHP ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

8 Comments(s) _↴

Also in #PHP _↴