/    Sign up×
Community /Pin to ProfileBookmark

I’ve got a regexp problem and I suck too much to be able figure out how to fix it.

Hi folks. A while ago I was trying to put together a text processing class and was lucky enough to be graced by the assistance of the all knowing NogDog. It came in the form of a regexp that (at the time) fulfilled my requirements perfectly. My requirements on some filters however, has expanded a little and I’ve developed it a bit based on what NogDog started me with. After about a day of driving myself mad trying to isolate the problem, I’ve finally found why it keeps going wrong. The only problem is, I don’t know how to fix it.

Here’s the expression I’m having problems with:

[code]
“/b”.$linkstock[0].”b(?!(([^<]*>)|([^([{)]*(}]))|([^([{link( )?to=”([^”rn]*)”( )?is=”([^”rn]*)”}])]*([{/link}]))))/i”
[/code]

And here’s a slightly more readable breakdown of it:

[code]
“/
b”.

$linkstock[0].

“b

(?!
(
([^<]*>)

|

([^([{)]*(}]))

|

(
[^
(
[{
link

( )?

to=”

([^”rn]*)

( )?

is=”

([^”rn]*)


}]
)
]

*

([{/link}])
)
)
)
/i”
[/code]

The problem comes in the latter section. It is caused by these inverted sets:

[code]
([^”rn]*)
[/code]

On their own, they’re fine, but the problem is, if you look a little earlier on than those sets, you’ll see that they themselves are inside an inverted set. That is where my problem lies. PHP5 says:

[quote]

Warning: preg_replace() [function.preg-replace]: Compilation failed: unmatched parentheses

[/quote]

I would imagine most folks that are reasonably familiar with regexp will be able to understand what I inted the express, I just don’t know how to without nesting those inverted sets. ?

to post a comment
PHP

12 Comments(s)

Copy linkTweet thisAlerts:
@Jeff_MottJul 29.2005 — [font=courier]([^([{)]*(}]))[/font][/quote]This bit seems to have something wrong with it (9th line of your broken down version). Let's start by stipping off the outermost pair of parentheses.[^([{)]*(}])It becomes pretty obvious now that something is definately not matched up correctly. Though I have no idea what you're trying to match so I can't really tell you how to fix it.
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorJul 29.2005 — The section you picked out there is for matching the strings [{ and }].

Originally, the "simplified" version was for matching whatever the variable $linkstock[0] contained anywhere in the subject scring except for sections of the subject string that are inside < and > (basically html tags), but now I've extended it to match the contents of the variable anywhere in the subject except for sections of the subject delimited by < and >, or [{ and }], or [{link to="anything that isn't a " or a new line here" is="anything that isn't a " or a new line here"}] and [{/link}].

That bit you picked out is fine though. It's the section after that which is causing the problem. The nested inverted sets that are intended to match [{link to="anything that isn't a " or a new line here" is="anything that isn't a " or a new line here"}] and [{/link}]. I'm guessing I need another way to express anything that isn't a " or a new line here in the attributes of to="" and is="" without using inverted sets.
Copy linkTweet thisAlerts:
@NogDogJul 29.2005 — OK, my brain is too foggy today to figure out that whole thing, but I'm pretty sure at least part of the problem is attempting to use "[ ]" to do more than they can. The only thing that should be between a pair of square brackets (a "character class definition") is a list of characters, optionally preceded by a negation symbol ("^"). All the characters within (and there are shortcuts for certain groups of characters or ranges of characters) are used to establish a match for precisely one character in the searched string which falls within the set of characters defined by that character class.

Therefore, there is no nesting of character classes, nor groupings with "( )", etc., as all you are able to match is one character within the class being defined.

So, what is the solution? That, my friend, probably needs more thought than my brain is capable of performing right now.
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorJul 30.2005 — Well the grouping of characters inside the set is fine, I know that because ([^([{)]*(}]))</URL> still matches what I want it to perfectly fine (the [{ and }] delimiters). I know for certain it is those nested sets, because if I just remove them so that it looks for [{link to="" is=""}] and [{/link}], then there is no problem at all. It's just when I come to say that I want anything that is not a new line or a quotation mark by using those inverted sets [^"rn] that the error arises.

Isn't there some other form of not operator or anything? like expressing the three characters in a group seperated with the or operator ("|r|n) and then inverting it with a not operator?

Like
<i>
</i>!("|r|n)

or

(!("|r|n))

or

(!"|r|n)


Or something like that?
Copy linkTweet thisAlerts:
@Jeff_MottJul 30.2005 — I think I see what you're doing now. You're only really checking for the closing ">" or "}]" or "[{link}]".

Firstly, I don't think you want a look behind assertion, I think you want a look ahead. The look behind will always be trying to match against the last character of $linkstock[0].

Secondly, having a set within a set is not your problem because that can't happen in PHP. Some regexp implementations allow that (JavaScript, for instance), but PHP does not. Most likely all you've done by trying is to add thte characters "[" and "^" to the larger character class.

Thirdly, NogDog is right, you're trying to make character classes do more than they can. They cannot do negations on full patterns, for instance, only an unorded list of characters.

Also a problem is that I don't think regular expressions provide a way of saying "match any string but this one" like the way you are trying to say "match any string that is not '[{link to='... Traditional the way someone would do this is by manipulating the logic that is returned by the regular expression. For example (psuedo code)! /bar/This would be true if the string "bar" does not occur. But if you wanted to make sure, for instance, that someone never says "that stupid bar" but anything else, i.e., "that [i]adjective[/i] bar", is ok then you've got a problem because there's no way to express "not the string 'stupid'".

Though there is an ugly kind of work around. It does use character classes, but since character classes only operate on single characters we have to do make one class for every character. For example/that ((?:[^s]|S[^t]|S{2}[^u]|S{3}[^p]|S{4}[^i]|S{5}[^d])S*) bar/This is a very long and very ugly way of making sure that at least one of the characters you would need to write the word "stupid" is not present.

Obviously this is not desirable. So again what you would have to do is manipulate the logic returned by the regexp. If you cannot easily match the good cases, then match the bad case and invert it.! /that stupid bar/
Copy linkTweet thisAlerts:
@Jeff_MottJul 30.2005 — Well the grouping of characters inside the set is fine, I know that because

Code:

([^([{)]*(}]))

still matches what I want it to perfectly fine (the [{ and }] delimiters)[/quote]
What this matches is, first, zero or more of [i]any[/i] single character that is not a "(", "[", "{" or ")". Then the string "}]" at the end of that. I don't think that's quite what you wanted.
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorJul 30.2005 — No, those earlier ones are matching exactly what I want. You can see it in action (rather crudely) at http://www.dootdootdoodydoodydootdoodoooo.com/test/obtest2.php

Paste in the following:


e3

[{e3}]

[e3]

{e3}

{[e3]}
[/quote]


The results you see are exactly what I want with the exception of if you put "e3" or "E3" between [{link to="whatever" is="whatever"}] and [{/link}]. The filters will still convert the text listed in my database when between those two items. I do not want the filters to apply to text inside that so I want my regexp so I can say apply my filters to whatever matches this expression, and have the expression match anything outside of those item pairs. I think I might actually be getting close now. I'll just have to keep plugging away.
Copy linkTweet thisAlerts:
@Jeff_MottJul 30.2005 — Do these cases all make sense?[e3}]
{e3}]
(e3}]

(e3]}
{e3]}
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorJul 30.2005 — Ah. Those catch it out.

<i>
</i>

[e3}]


{e3}]


(e3}]
.

I would rather those "e3"'s were matched for conversion.

I tried out something that didn't work not long ago which would have given you the wrong results if you tried at the time, but it's back to how it was now.
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorAug 01.2005 — Ok. I've rewritten it now and it's a bit better. I thought I actually had it, but it's still a bit off. Here's what I've got it to so far:

<i>
</i>$find = "/bcatb(?!(([^&lt;])*(&gt;))|(([^[][^{])*(}]))|(([^[][^{][^l][^i][^n][^k]([}]])*[^}][^]])*([{/link}])))/i";


The more readable version being:
<i>
</i>$find = "/
b
cat
b
(?!
(
([^&lt;])*
(&gt;)
)
|
(
([^[][^{])*
(}])
)
|
(
(
[^[]
[^{]
[^l]
[^i]
[^n]
[^k]
(
[}]]
)*
[^}]
[^]]
)*
([{/link}])
)
)
/i";


You can test it at http://www.dootdootdoodydoodydootdoodoooo.com/test/regexphell.php

It looks for "cat" and replaces "cat" with "boat".

I've tested it by feeding it the following text:


<cat>cat a cat a dog moo a woof roar cat magic</cat>

[{cat}]cat a cat a dog moo a woof roar cat magic[{cat}]

[{link}]a cat a dog moo a woof roar cat magic[{/link}]

[{link to="cat a cat a dog moo a woof roar cat magic"}]cat a cat a dog moo a woof roar cat magic[{/link}]

[{link to="cat a cat a dog moo a woof roar cat magic" is="cat a cat a dog moo a woof roar cat magic"}]cat a cat a dog moo a woof roar cat magic[{/link}]
[/quote]


There seems to be a strange pattern of alternation in the output in the [{link}] type lines via the attributes. Works, doesn't work, works, doesn't and so on. Can you see why it doesn't work?
Copy linkTweet thisAlerts:
@griff777Aug 01.2005 — Hi folks. A while ago I was trying to put together a text processing class and was lucky enough to be graced by the assistance of the all knowing NogDog. It came in the form of a regexp that (at the time) fulfilled my requirements perfectly. My requirements on some filters however, has expanded a little and I've developed it a bit based on what NogDog started me with. After about a day of driving myself mad trying to isolate the problem, I've finally found why it keeps going wrong. The only problem is, I don't know how to fix it.

Here's the expression I'm having problems with:

<i>
</i>"/b".$linkstock[0]."b(?!(([^&lt;]*&gt;)|([^([{)]*(}]))|([^([{link( )?to="([^"rn]*)"( )?is="([^"rn]*)"}])]*([{/link}]))))/i"


And here's a slightly more readable breakdown of it:

<i>
</i>"/
b".

<i> </i>$linkstock[0].

<i> </i>"b

<i> </i>(?!
<i> </i> (
<i> </i> ([^&lt;]*&gt;)

<i> </i> |

<i> </i> ([^([{)]*(}]))

<i> </i> |

<i> </i> (
<i> </i> [^
<i> </i> (
<i> </i> [{
<i> </i> link

<i> </i> ( )?

<i> </i> to="

<i> </i> ([^"rn]*)

<i> </i> "

<i> </i> ( )?

<i> </i> is="

<i> </i> ([^"rn]*)

<i> </i> "
<i> </i> }]
<i> </i> )
<i> </i> ]

<i> </i> *

<i> </i> ([{/link}])
<i> </i> )
<i> </i> )
<i> </i>)
/i"


The problem comes in the latter section. It is caused by these inverted sets:

<i>
</i>([^"rn]*)


On their own, they're fine, but the problem is, if you look a little earlier on than those sets, you'll see that they themselves are inside an inverted set. That is where my problem lies. PHP5 says:



I would imagine most folks that are reasonably familiar with regexp will be able to understand what I inted the express, I just don't know how to without nesting those inverted sets. ?[/QUOTE]



The error message looks like the " marks are unbalanced. Try a "

(thought I read that somewhere)
Copy linkTweet thisAlerts:
@Stephen_PhilbinauthorAug 01.2005 — No, the escape is only there to have it appear as a literal '"' in the expression, but I'm not using that pattern any more anyway. I'm using the most recently posted one.
×

Success!

Help @Stephen_Philbin spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.18,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...