/    Sign up×
Community /Pin to ProfileBookmark

Batch ahref attribute string manipulation software

What I want to do is to remove the bold substring from millions of links in hundred thousands of documents:

content.aspx@aID=79446[B]&searchStr=[COLOR=SlateGray]some parameters[/COLOR][/B]#79446

Does anyone know of a software that can remove, append and otherwise manipulate the contents of HTML tags?

to post a comment
HTML

12 Comments(s)

Copy linkTweet thisAlerts:
@CharlesJun 29.2006 — Does anyone know of a software that can remove, append and otherwise manipulate the contents of HTML tags?[/QUOTE]I know of a few, just which one I'd use would depend upon the particulars.

If this HTML then I'd use PERL and the TreeBuilder module to grab each file and parse.

If this is XHTML then I'd use XSLT and Xalan or Saxon to parse the documents. Depending on the file structure I'd use either a simple batch file or, again, Perl to coordinate the thing.
Copy linkTweet thisAlerts:
@KeveyJun 29.2006 — There may be a better/easier way, but I use MultiUpdate to make global changes across about 1300 pages, and it works very well. Not sure how it would perform on hundreds of thousands of documents though. http://www.ksware.com/mu.html
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — Thank you for your promt answer, Charles! Here is the structure of one of the documents:


<HTML>

<HEAD>

<title id="pageTitle">AccessMedicine - ***** cinguli</title>

<link rel="stylesheet" href="../global.css" type="text/css">

<script language="javascript" src="../Global.js"></script>

</HEAD>

<body style="MARGIN:0px" bgcolor="#E2E6EB">


---------------------------

---------------------------

<tr>
<td colspan="4"><br class="Spacer5"><span id="lblResult"><table cellspacing="0" cellpadding="0" border="0" width="100%" style="border-top:1px solid #999999;"><tr><td colspan="2" height="5"><img src="../images/spacer.gif"></td></tr><tr><td width="32" align="left" valign="top"><img align="absbottom" src="../images/coverSearch_ropp.gif"></td><td valign="top"><a class="font11noMargin" href="../[COLOR=Red]content.aspx@aID=971383&searchStr=*****+cinguli#971383[/COLOR]"><b>The role of the <i>cingulate *****</i> in the behavior of animals and humans has been the subject o...</b></a><br/><font class="font10DarkGray"><b>Adams and Victor's Neurology</b> &gt; Chapter 25. The Limbic Lobes and the Neurology of Emotion &gt; The Limbic Lobes and the Neurology of Emotion: Introduction &gt; Physiology of the Limbic System</font></td/></tr><tr><td colspan="2" height="5" style="border-bottom:1px solid #999999;"><img src="../images/spacer.gif"></td></tr></table></span><br class="Spacer5"></td>
</tr>
<tr>
<td width="70%" class="font11noMargin"><span id="lblCurrentRecords2"><font color="#666666"><b>1-1</b> of <b>1 Results</b></font></span></td>
<td width="15%" align="right">

</td>
<td width="5%" align="center">

</td>
<td width="10%" align="left">

</td>
</tr>
</table>
</td>
</tr>

</table>

</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</form>
</TBODY>
</table>
<table border="0" cellpadding="0" cellspacing="0" width="772">
<tr>
<td width="9">&nbsp;</td>
<td>

<table border="0" width="100%" cellpadding="0" cellspacing="0" bgcolor="#E2E6EB">

<tr>

<td valign="top">

<br class="Spacer5">

<font class="font10">Copyright ©2006 The McGraw-Hill Companies.&nbsp;&nbsp;All rights reserved.<br>

<a href="../public/privacy.aspx">Privacy Notice</a>.&nbsp;Any use is subject to the <a href="../public/termsofuse.aspx">Terms of Use</a> and <a href="../public/notice.aspx">Notice</a>.&nbsp;&nbsp;<a href="../public/additionalcredits.aspx">Additional Credits and Copyright Information</a>.</font><br><br class="Spacer8">

<table border="0" width="100%" cellpadding="0" cellspacing="0">

<tr>

<td width="26%" align="left"><a href="http://www.mheducation.com/" target="_blank"><img src="../images/logo_mcgraw_hill.gif" alt="McGraw-Hill Education" border="0" align="top"></a></td>

<td width="41%" align="left"><a href="http://www.silverchair.com/" target="_
blank"><img src="../images/logo_ssc.gif" border="0" alt="A Silverchair Information System"></a></td>

<td width="33%" align="right"><a href="http://www.mcgrawhill.com/" target="_blank"><img src="../images/logo_mcgraw-hill_gradient.jpg" alt="The McGraw-Hill Companies" border="0" vspace="5"></a></td>

</tr>

</table>

<br class="Spacer10">

</td>

</tr>

</table>

<a href="../manifest/resource_manifest.htm"></a>

</td>

</tr>

</table>

</body>

</HTML>
Copy linkTweet thisAlerts:
@CharlesJun 29.2006 — Yes, Perl's the way to go. And while you are at you could use the parser to clean up that atrocious HTML.

I'll warn you though, the learning curve is a little steep. The good folks over at the Perl forum should be able to get you started.

You could also do this with Perl and regular expressions. It's a good bit easier to do but it can get a little bit ugly and might have some unwanted side effects.
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — Thank you, Kevey!

I used dreamweaver to modify constant string through the find and replace option. It says MultiUpdate changes constant strings only. Can I set it up to delete anything, not just constant strings between the & and # characters for example?
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — Charles, you mean I have to learn a new programming language to accomplish this? Or there is a more user friendly way?
Copy linkTweet thisAlerts:
@CharlesJun 29.2006 — Charles, you mean I have to learn a new programming language to accomplish this? Or there is a more user friendly way?[/QUOTE]No, I mean you get to learn a new language - a very happy situation.
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — No, I mean you get to learn a new language - a very happy situation.[/QUOTE]

Well, it depends on the teachers too ? Are there any tutorials for the tning I want to do to get me started before going directly to the forum?
Copy linkTweet thisAlerts:
@CharlesJun 29.2006 — If you went with the regular expression method then this would be what we call a "one liner", simple enough that the code is passed on the command line. For one file it would be something like perl -pi~ -e 's/&amp;searchStr=.+?#79446//gs' file_nameIf your files are all in one directory then you could use a simple batch file to run that for each file. Else the perl gets a little more complicated.
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — If you went with the regular expression method then this would be what we call a "one liner", simple enough that the code is passed on the command line.[/QUOTE]

Thank you for your help, Charles! Before I understand how this simple one line works, I have to read some thousand lines as I'm completely new to anything that has to do with PERL. Any recommendations what a newbie should read to make use of the above example?
Copy linkTweet thisAlerts:
@CharlesJun 29.2006 — Go to the Perl forum, describe your problem, describe where the files are and how they are named and describe your problem. Someone helpful will be along soon to get you going - chances are they'll give you the script you need and a few pointers on where to go from there.
Copy linkTweet thisAlerts:
@Vassil_CatsarovauthorJun 29.2006 — Go to the Perl forum, describe your problem, describe where the files are and how they are named and describe your problem. Someone helpful will be along soon to get you going - chances are they'll give you the script you need and a few pointers on where to go from there.[/QUOTE]

Thank you for your cooperation - I'll follow your advice!
×

Success!

Help @Vassil_Catsarov spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.19,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...