/    Sign up×
Community /Pin to ProfileBookmark

Help with regular expressions

Hi,

Im trying to extract information from html source code from a website. I have the source stored in a variable, $listings, and want to split it into an array of matches, $matches, using the preg_match_all function like so:

preg_match_all(‘/<li>(.*)</li>/’, $listings, $matches);

I want to match blocks of HTML found between list tags.

The above code doesn’t seem to match anything when i print out the array. I have tested the regex in a regex testing tool, so im inclined to think im using the function wrong.

I’d appreciate any ideas,

Thanks

to post a comment
PHP

8 Comments(s)

Copy linkTweet thisAlerts:
@blue-eye-labsOct 29.2009 — If you're trying to read through an HTML document why not use something like DOMDocument?
Copy linkTweet thisAlerts:
@gscauthorOct 29.2009 — Im hoping to mine large amounts of information, will this work as quickly as regex do you know?
Copy linkTweet thisAlerts:
@blue-eye-labsOct 30.2009 — Im hoping to mine large amounts of information, will this work as quickly as regex do you know?[/QUOTE]

I'm afraid I am not really sure but I gather regex can be a fairly intensive operation. I found an interesting discussion here:

http://stackoverflow.com/questions/1538584/regular-expression-vs-xml-functions-in-php

and here's another interesting one:

http://www.talkphp.com/tips-tricks/4476-zfs-zend_dom-domdocument-wrapper.html

I hope those help slightly.
Copy linkTweet thisAlerts:
@MindzaiOct 30.2009 — DomDocument is a far better option than using regex. Your current regex for example, would match from an opening <li> to a closing <li>. If a page has >1 <ol> or <ul>, you are going to match from the start of the first to the end of the last, and everything in between.
Copy linkTweet thisAlerts:
@gscauthorNov 26.2009 — Hi guys,

I took your advice and created some code for screen scraping with DOMDocument which worked brillint originally. I recently went back to my code and im getting errors with the loadHTML function such as:

<i>
</i>ID week-1 already defined in Entity, line: 126.
htmlParseEntityRef: no name in Entity, line: 303.


After looking at the source of the page im trying to scrape, it seems they have changed the source, and it is too badly formed for loadHTML to parse (I think).

I checked it with w3c checker and it showed 330 errors in the page with the DOCTYPE the page is set to. Im unsure if this is related, but it seems to be the case.

Does anyone have any ideas?

Much appreciated
Copy linkTweet thisAlerts:
@MindzaiNov 26.2009 — Ask the site owner for an API. If they don't provide anything, they probably don't want people scraping their pages.
Copy linkTweet thisAlerts:
@chris22Nov 26.2009 — This is a great tool for parsing HTML documents:

http://sourceforge.net/projects/simplehtmldom
Copy linkTweet thisAlerts:
@blue-eye-labsNov 27.2009 — Both the XML and DOM parsers in PHP are fairly strict and don't like badly formed stuff so I think the latest suggestion made by chris22 would be the way to go.
×

Success!

Help @gsc spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.28,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...