/    Sign up×
Community /Pin to ProfileBookmark

How to Access DOM of Any Url?

I’m trying to extract links for a web crawler for a search engine. The browser automatically makes relative links absolute, and I can use this capability with document.anchors to get an array of links. So how can I replace “document” with the url of a web page I’m not on?

If I can’t, should I make an iframe that goes to the location and returns document.anchors? The code that asks for the links only runs when someone clicks on a link to a search result, but visiting maybe a hundred locations in the iframe could make the page redirect before the code finishes running.

What should I do?

to post a comment
HTML

7 Comments(s)

Copy linkTweet thisAlerts:
@ray326May 06.2008 — You'd create a document object (with a different name obviously) and load it with an xmlhttprequest call.
Copy linkTweet thisAlerts:
@zenoxmanauthorMay 06.2008 — So if I do

xbj = new ActiveXObject("Microsoft.XMLHTTP");

xbj.open("post", [[B]url I want to access the DOM of[/B]], false);

xbj.send(null);

doc = xbj.responseText;

I can refer to the DOM of any url, for example doc.anchors?
Copy linkTweet thisAlerts:
@ray326May 07.2008 — Well that's one thing I'd try at least. You first have to get the document locally to do anything with it. You could also try loading it into an iframe ala Netscape's Inner Browsing.
Copy linkTweet thisAlerts:
@zenoxmanauthorMay 07.2008 — I've tried that, and doc isn't yet a DOM object. I'm investigating...
Copy linkTweet thisAlerts:
@zenoxmanauthorMay 08.2008 — I've searched and searched the internet for two days, and I think the best/only solution would be to use an invisible iframe. Would that be too slow a script to execute before the page redirects? Because actually, I also need to update data in a SQL database

(by using document.[id of invisible form].submit to execute the php action)

when a result link is clicked.

How can I make the iframe parse the responseText without displaying it? Because that's the slow part. And wouldn't the php/SQL operate in the background even after the redirect?
Copy linkTweet thisAlerts:
@ray326May 08.2008 — All the iframe would do is hold the response. I suspect the textual content of the iframe would have to be parsed into a document tree -- not a trivial thing unless you've found a parser to throw at it. Have you considered just searching through the text for anchors?
Copy linkTweet thisAlerts:
@zenoxmanauthorMay 08.2008 — That's the first thing I did, but I need to be able to convert relative urls to absolute.
×

Success!

Help @zenoxman spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...