Extracting data from html/xml/js website.

@imnotabotJun 21.2016

Hi guys.

I hopes you can help me with this task.

This website is very simple if you just look at it. It is used for finding specific files, just like Windows Explore etc.
In the left side there is a normal folder structure where everything is indexed. So if I go to some folder, like “Furniture” and then click “Chairs” and then click “Legs for chair” then I have x number of documents for chair legs in this folder. This particular page is in XML, so I have the file name, version, description etc. in a table, where I can click on the file I want, and then it downloads it to the harddrive (Probably through the javascript). This document is already on the hard drive in a “files” folder for the website, but the filenames are just numbers, like 4325436453 for example, so I don’t know what is what, and there are 5000 documents.

What I want is to extract all these files in their respective folders on a hard drive, instead of having to access the webpage all the time. This way I can load it to portable devices etc., which is good, because this webpage is inly compatible with IE 11 with MSXML and java etc. so I cannot just download the website to my Android or windows phone and make it work.

A secondary solution, would be to make some kind of bot or script that opens all the XML files, finds the description/title for the specific file names and renames the files, so that I will have both the document number and the description in the file name. This way I would be able to search for it.
The document number, version, file name, title etc. is in the same XML file, and if there are 3 documents available for download on a specific page, there will also be 3 document references in that XML file of course.

I thought about getting a script to to pair the document number and the title so that I can make some Excel magic and put it all into a -bat-file where I mass rename all the files in the folder.

Any suggestions?

Thank you very much.

to post a comment

HTML

7 Comments(s) _↴

@jedaisoulJun 21.2016 — #Hi and welcome to the site. what you are asking for cannot be done in HTML, and is not really anything to do with building web sites. If I understand you right, what you are asking to do is doable using FTP access for the original download, then using PHP on a local server to categorize and access the files.

@imnotabotauthorJun 22.2016 — #Hi and welcome to the site. what you are asking for cannot be done in HTML, and is not really anything to do with building web sites. If I understand you right, what you are asking to do is doable using FTP access for the original download, then using PHP on a local server to categorize and access the files.[/QUOTE]

Let me elaborate.

It is a local website on my computer. There is no FTP access or PHP or anything like that. It is just a local html file that gets me the document that I need.

No internet access.

It does not have anything to do with building websites.

@jedaisoulJun 22.2016 — #Thanks for clarifying. As I said, what you are asking for cannot be done in HTML. That is why you needed to use Microsoft specific extensions, which, as you have said, do not work on Android based machines.

If you want to code a retrieval system, that will find documents by searching their content then you could use a general purpose computer language, like C++. On the other hand, if you want a web based solution (for cross platform use), then I'd recommend setting up a local host and code the host app in PHP. I've had a look on the web and I've found [b][url=https://play.google.com/store/apps/details?id=ru.kslabs.ksweb&hl=en]KSWEB server[/url][/b] that runs on Android hardware, and comes with PHP. I would stress that I have never used it, I'm just suggesting possible routes for achieving what you want.

@imnotabotauthorJun 22.2016 — #Thanks for clarifying. As I said, what you are asking for cannot be done in HTML. That is why you needed to use Microsoft specific extensions, which, as you have said, do not work on Android based machines.

If you want to code a retrieval system, that will find documents by searching their content then you could use a general purpose computer language, like C++. On the other hand, if you want a web based solution (for cross platform use), then I'd recommend setting up a local host and code the host app in PHP. I've had a look on the web and I've found [b][url=https://play.google.com/store/apps/details?id=ru.kslabs.ksweb&hl=en]KSWEB server[/url][/b] that runs on Android hardware, and comes with PHP. I would stress that I have never used it, I'm just suggesting possible routes for achieving what you want.[/QUOTE]

Thank you.

This is also not what I want.

I want to extract all the files that the website is referring to into respective folders on the hard drive. So I don't have to use a webpage to find my files, but just need to browser folders like normal.

@jedaisoulJun 22.2016 — #You will still need to use a language like C++ or PHP if you want to automate the process of sorting the files into directories.

@SempervivumJun 22.2016 — #I want to extract all the files that the website is referring to into respective folders on the hard drive.[/QUOTE]Check whether WinHTTrack can do that for you.

@imnotabotauthorJun 22.2016 — #Winhtttrack cannot do this, because it needs to be able to read the website, and only IE 11 msxml compatible blahblah can read this website.

jedaisoul. Okay, it sounds more cumbersome than I hoped for it to be.

One of the links to a directory on the website could be mfcddocslist.html?view=1&level=5&level_1_id=163145&level_2_id=62870&level_3_id=153824&level_4_id=220181&treeid=70533850 which will load some list from an XML file. The link to a document could be javascript:showDoc("645634567.doc")

Also in #HTML _↴

Need help with a Form multipart/form-data Help putting information in the middle of the page.

Success!

Help @imnotabot spread the word by sharing this article on Twitter...

Tweet This

Extracting data from html/xml/js website.

7 Comments(s) _↴

Also in #HTML _↴

Success!

Social

Version

Extracting data from html/xml/js website.

7 Comments(s) ↴

Also in #HTML ↴

Success!

The web is an endless sea of information. Don't miss the boat... Subscribe!

Social

Version

7 Comments(s) _↴

Also in #HTML _↴