/    Sign up×
Community /Pin to ProfileBookmark

Extracting data from html/xml/js website.

Hi guys.

I hopes you can help me with this task.

This website is very simple if you just look at it. It is used for finding specific files, just like Windows Explore etc.
In the left side there is a normal folder structure where everything is indexed. So if I go to some folder, like “Furniture” and then click “Chairs” and then click “Legs for chair” then I have x number of documents for chair legs in this folder. This particular page is in XML, so I have the file name, version, description etc. in a table, where I can click on the file I want, and then it downloads it to the harddrive (Probably through the javascript). This document is already on the hard drive in a “files” folder for the website, but the filenames are just numbers, like 4325436453 for example, so I don’t know what is what, and there are 5000 documents.

What I want is to extract all these files in their respective folders on a hard drive, instead of having to access the webpage all the time. This way I can load it to portable devices etc., which is good, because this webpage is inly compatible with IE 11 with MSXML and java etc. so I cannot just download the website to my Android or windows phone and make it work.

A secondary solution, would be to make some kind of bot or script that opens all the XML files, finds the description/title for the specific file names and renames the files, so that I will have both the document number and the description in the file name. This way I would be able to search for it.
The document number, version, file name, title etc. is in the same XML file, and if there are 3 documents available for download on a specific page, there will also be 3 document references in that XML file of course.

I thought about getting a script to to pair the document number and the title so that I can make some Excel magic and put it all into a -bat-file where I mass rename all the files in the folder.

Any suggestions?

Thank you very much.

to post a comment
HTML

7 Comments(s)

Copy linkTweet thisAlerts:
@jedaisoulJun 21.2016 — Hi and welcome to the site. what you are asking for cannot be done in HTML, and is not really anything to do with building web sites. If I understand you right, what you are asking to do is doable using FTP access for the original download, then using PHP on a local server to categorize and access the files.
Copy linkTweet thisAlerts:
@imnotabotauthorJun 22.2016 — Hi and welcome to the site. what you are asking for cannot be done in HTML, and is not really anything to do with building web sites. If I understand you right, what you are asking to do is doable using FTP access for the original download, then using PHP on a local server to categorize and access the files.[/QUOTE]

Let me elaborate.

It is a local website on my computer. There is no FTP access or PHP or anything like that. It is just a local html file that gets me the document that I need.

No internet access.

It does not have anything to do with building websites.
Copy linkTweet thisAlerts:
@jedaisoulJun 22.2016 — Thanks for clarifying. As I said, what you are asking for cannot be done in HTML. That is why you needed to use Microsoft specific extensions, which, as you have said, do not work on Android based machines.

If you want to code a retrieval system, that will find documents by searching their content then you could use a general purpose computer language, like C++. On the other hand, if you want a web based solution (for cross platform use), then I'd recommend setting up a local host and code the host app in PHP. I've had a look on the web and I've found [b][url=https://play.google.com/store/apps/details?id=ru.kslabs.ksweb&hl=en]KSWEB server[/url][/b] that runs on Android hardware, and comes with PHP. I would stress that I have never used it, I'm just suggesting possible routes for achieving what you want.
Copy linkTweet thisAlerts:
@imnotabotauthorJun 22.2016 — Thanks for clarifying. As I said, what you are asking for cannot be done in HTML. That is why you needed to use Microsoft specific extensions, which, as you have said, do not work on Android based machines.

If you want to code a retrieval system, that will find documents by searching their content then you could use a general purpose computer language, like C++. On the other hand, if you want a web based solution (for cross platform use), then I'd recommend setting up a local host and code the host app in PHP. I've had a look on the web and I've found [b][url=https://play.google.com/store/apps/details?id=ru.kslabs.ksweb&hl=en]KSWEB server[/url][/b] that runs on Android hardware, and comes with PHP. I would stress that I have never used it, I'm just suggesting possible routes for achieving what you want.[/QUOTE]


Thank you.

This is also not what I want.

I want to extract all the files that the website is referring to into respective folders on the hard drive. So I don't have to use a webpage to find my files, but just need to browser folders like normal.
Copy linkTweet thisAlerts:
@jedaisoulJun 22.2016 — You will still need to use a language like C++ or PHP if you want to automate the process of sorting the files into directories.
Copy linkTweet thisAlerts:
@SempervivumJun 22.2016 — I want to extract all the files that the website is referring to into respective folders on the hard drive.[/QUOTE]Check whether WinHTTrack can do that for you.
Copy linkTweet thisAlerts:
@imnotabotauthorJun 22.2016 — Winhtttrack cannot do this, because it needs to be able to read the website, and only IE 11 msxml compatible blahblah can read this website.

jedaisoul. Okay, it sounds more cumbersome than I hoped for it to be.

One of the links to a directory on the website could be mfcddocslist.html?view=1&level=5&level_1_id=163145&level_2_id=62870&level_3_id=153824&level_4_id=220181&treeid=70533850 which will load some list from an XML file. The link to a document could be javascript:showDoc("645634567.doc")
×

Success!

Help @imnotabot spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.1,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...