/    Sign up×
Community /Pin to ProfileBookmark

Getting delimiter from a line

Hi

Can someone please tell me the best possible way to find the delimiter from a given line (not including the spaces)?

For our convenience we can assume use the email address to split if necessary. So the very next char (except space) after the email can be our delimiter. But there may be cases where the email address is at last and no delimiter are there.

Some examples are:

Ex 1:

[CODE]
jon, doe, [email protected], 996655
[/CODE]

Ex 2:

[CODE]
[email protected]; doe; ;996655
[/CODE]

Ex 3:

[CODE]
jon# doe# 996655# [email protected]
[/CODE]

Ex 4:

[CODE]
jon doe 96655
[/CODE]

Ex 5:

[CODE]
jon doe 996655 [email protected]
[/CODE]

Ex 6:

[CODE]
jon;doe;[email protected];996655;
[/CODE]

In ex 4 and 5 above, it should return as no delimiter found.

Any help is appreciated.

Thanks

to post a comment
PHP

20 Comments(s)

Copy linkTweet thisAlerts:
@NogDogJun 20.2013 — So what would the expected response be for:
<i>
</i>foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."

?
Copy linkTweet thisAlerts:
@phantom007authorJun 20.2013 — So what would the expected response be for:
<i>
</i>foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."

?[/QUOTE]



Hi Thanks for ur reply.

When I saved that line of yours and opened in my spreadsheet program specifying the delimiter as #, it spitted up the line very nicely. Chk screenshot.


http://i.imgur.com/K8uEBlC.png

How are they doing it?
Copy linkTweet thisAlerts:
@NogDogJun 21.2013 — Hi Thanks for ur reply.

When I saved that line of yours and opened in my spreadsheet program specifying the delimiter as #, it spitted up the line very nicely. Chk screenshot.


http://i.imgur.com/K8uEBlC.png

How are they doing it?[/QUOTE]


Ahh...but how did you know "#" was the delimiter and not "."? ("#" is the correct answer in this case if you want valid CSV format based on the quoting, but what if you remove all the quotes?)

So what I'm ultimately pushing toward here in order to nail down the actual requirement, is how do you determine which is the correct separator character in any given line of text, especially if that text contains two or more candidate special characters?

Once we have a precise requirement, the code may then become self-evident. For instance, if the requirement is simply to break the text into separate fields using any of ".,;|#" as separators, you might use preg_split() with a simple character class as the separator. If the result must always be 3 fields and using only one separator chosen from a set of possible separators, you might have to loop through a set of possible separators (perhaps using foreach() on an array of separator characters) until you find one that gives you 3 fields (or returns an error if you exhaust the possible separators with no result with 3 fields).
Copy linkTweet thisAlerts:
@phantom007authorJun 21.2013 — Hello NogDog

Thanks for your reply.

The idea is to get the very NEXT char (except white space) after the email field and assume it to be the delimiter. Now, here are the possible cases:

Case 1. Email can be in the first column - In this case we get the delimiter by getting the very next char after email (except white space)

Case 2: Email can be in the middle of the columns - In this case we get the delimiter by getting the very next char after email (except white space)

Case 3: Email can be at the last column - In this case we get the delimiter by getting the char before email (except white space)

Case 4: Email is the only column - In this case the delimiter is not required.

Case 5: Email does not exists - In this case we show an error since email is mandatory here.

So, the questions is how do we achieve this? Using some regex pattens? what regex would that be?
Copy linkTweet thisAlerts:
@NogDogJun 21.2013 — This is looking promising, but you should add some more test cases, including negative tests to see if it really works. ?

[code=php]
<?php

function getDelimiter($str, $debug = false)
{
$email = 'w[^@s]*@[^@s]+w';
$regex = '/(^|S)s*'.$email.'s*($|S)/';
if(preg_match($regex, $str, $matches)) {
if($debug) {
echo "<pre>".var_export($matches,1)."</pre>n";
}
foreach($matches as $match) {
if(strlen($match) == 1) {
return $match;
}
}
return false;
}
else {
if($debug) {
echo "<p>Nope</p>n";
}
return false;
}
}

$data = array(
'[email protected]# foo# bar',
'foo #[email protected] #bar',
'foo # bar # [email protected]'
);
foreach($data as $test) {
$delimiter = getDelimiter($test, true);
echo "<p>Delimiter for '$test' is '$delimiter'.</p>n";
}
[/code]
Copy linkTweet thisAlerts:
@phantom007authorJun 21.2013 — Thanks once again

Can I use the following regex for email instead of the one in ur code?

$email = '/([+a-zA-Z0-9._-]+@[a-zA-Z0-9._-]{2,}.[a-zA-Z]{2,6})/i';

Though I am not sure if this is more powerful than the one u used, its just that urs is confusing to me.

Pls suggest.

Thanks
Copy linkTweet thisAlerts:
@phantom007authorJun 21.2013 — One more question, what if the delimiter is a non-utf char? Is there anyway to detect that?
Copy linkTweet thisAlerts:
@phantom007authorJun 21.2013 — Also noticed a problem, if there are no delimiters at all

[CODE][email protected] John Mathew [/CODE]

The above returns J as a delimiter. Perhaps if we can ignore a-zA-Z-0-9 it should do the trick?


and for the following string, it returns the delimiter as T ?

[CODE][email protected];CHARIOT Tichel;[/CODE]

Thanks
Copy linkTweet thisAlerts:
@NogDogJun 21.2013 — I'm starting to think the only viable solution -- outside of requiring the data source use one specific delimiter, preferably with proper CSV formatting -- is to create a "white list" of allowed delimiters and test against each one until you get the correct number of fields.
[code=php]
$delims = array(',', ';', '|', '#');
$delimiter = null;
foreach($delims as $delim) {
$parts = explode($delim, $text);
if(count($parts == 3) { // or whatever correct value is
$delimiter = $delim;
break;
}
}
[/code]
Copy linkTweet thisAlerts:
@phantom007authorJun 22.2013 — Hi

Thanks for the reply and taking the pain to code it but I am not sure how to integrate that new code of yours.

BTW, here is another example:

$str = '000020;ACTIVE;AU VIEUX CAMPEUR;;48 RUE DES ECOLES; ;75005;PARIS;president JACQUES YVES DE RORTHAYS;AF14;M.;Jean-Jacques DENUAU;06 08 16 65 62;[email protected];F004;Melle;Magali SUREDA;06 86 48 23 30;[email protected];Melle;Anne-Charlotte MICHELET;[email protected]';

<i> </i>$a = getDelimiter($str);

echo $a; //returns 5 which is incorrect. It should return ;



I am not sure if its too hard to implement the following logic using regex.


Case 1: Get the next char (except a-zA-Z0-9.rnf and white space) of the first email address found in a given line.

Case 2: If there is no char found in Case 1, it should get the previous char before the email (except a-zA-Z0-9.rnf and white space).

Case 3: If there are no chars found in case 1 and case 2, it should return false.


Please help.
Copy linkTweet thisAlerts:
@phantom007authorJun 22.2013 — Hi

Thanks for the reply and taking the pain to code it but I am not sure how to integrate that new code of yours.

BTW, here is another example:

<i>
</i>$str = '000020;ACTIVE;AU VIEUX CAMPEUR;;48 RUE DES ECOLES; ;75005;PARIS;president JACQUES YVES DE RORTHAYS;AF14;M.;Jean-Jacques DENUAU;06 08 16 65 62;[email protected];F004;Melle;Magali SUREDA;06 86 48 23 30;[email protected];Melle;Anne-Charlotte MICHELET;[email protected]';

<i> </i>$a = getDelimiter($str);

echo $a; //returns 5 which is incorrect. It should return ;





I am not sure if its too hard to implement the following logic using regex.


Case 1: Get the next char (except a-zA-Z0-9.rnf and white space) of the first email address found in a given line.

Case 2: If there is no char found in Case 1, it should get the previous char before the email (except a-zA-Z0-9.rnf and white space).

Case 3: If there are no chars found in case 1 and case 2, it should return false.


Please help.
Copy linkTweet thisAlerts:
@NogDogJun 23.2013 — [code=php]
<?php
/**
* Try to figure out what the delimiter is by looking for email address
* @return string (false if not found)
* @param string $str string to search
* @param bool $debug whether to output debug info (default to false)
*/
function getDelimiter($str, $debug=false)
{
static $regex = '/(^|S)s*[^()<>@,;:\".[] 00-31][^()<>@,;:\"[] 00-31]*@[^()<>@,;:\"[] 00-31]*[^()<>@,;:\".[] 00-31]+s*(S|$)/';
if(preg_match($regex, $str, $matches)) {
if($debug) {
echo "<pre>Degbug: found email:n".var_export($matches,1)."</pre>n";
}
for($ix=1; $ix<=2; $ix++) {
if(!empty($matches[$ix]) and preg_match('/W/', $matches[$ix])) {
return $matches[$ix];
}
}
}
return false;
}

// test it:
$test = array(
'000020;ACTIVE;AU VIEUX CAMPEUR;;48 RUE DES ECOLES; ;75005;PARIS;president JACQUES YVES DE RORTHAYS;AF14;M.;Jean-Jacques DENUAU;06 08 16 65 62;[email protected];F004;Melle;Magali SUREDA;06 86 48 23 30;[email protected];Melle;Anne-Charlotte MICHELET;[email protected]',
'[email protected];CHARIOT Tichel',
'jon# doe# 996655# [email protected]',
'jon doe 996655 [email protected]'
);

foreach($test as $str) {
echo "<pre>$str:n";
$result = getDelimiter($str, true);
echo var_export($result,1)."</pre>n";
}
[/code]
Copy linkTweet thisAlerts:
@phantom007authorJun 23.2013 — Looks good so far except that it returns false when it finds a tab delimiter.

<i>
</i>$test = array(
"jon doe [email protected] 996655"
);


Returns
jon doe [email protected] 996655
false



I will be doing more test and will come to you if I find more issues ?

Can u also please tell me what exactly the regex in your code doing?

Thanks for your hard work
Copy linkTweet thisAlerts:
@NogDogJun 23.2013 — You want tabs, too? You said to skip white-space, which normally includes tabs. ? If you tell me you also want underscores as delimiters, I may have to send somebody to rough you up a bit. :p
Copy linkTweet thisAlerts:
@phantom007authorJun 23.2013 — Sorry if I bothered you but I think its normal that CSV files contain tabs (t) as delimiters so it should be valid, when I said white space I actually meant the space created by spacebar key on our keyboard.

Thanks and sorry once again

PS, please if u can explain me what exactly your regex is doing?
Copy linkTweet thisAlerts:
@NogDogJun 23.2013 — Just to be pedantic, CSV files use commas as delimiters (that's what the "C" stands for), with pretty specific rules for things like using quotes.

I stole parts of what I used to identify the email address from here: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

The only parts you should have to mess around with are this at the beginning:
<i>
</i>(^|S)s*
...and this at the end...s*(S|$)

[B]s[/B] is any white-space (space, tab, newline, carriage return, vertical tab)

[B]S[/B] is any character that is NOT a white-space character

To include tabs as delimiters, I believe you could change those parenthesized bits to include the tab character via [B]t[/B]:
<i>
</i>(^|[St])s*

<i>
</i>([St]|$)
Copy linkTweet thisAlerts:
@phantom007authorJun 23.2013 — Hi

This is the regex that i am using after modification


static $regex = '/(^|[St])s*[^()<>@,;:".[] 00-31][^()<>@,;:"[] 00-31]*@[^()<>@,;:"[] 00-31]*[^()<>@,;:".[] 00-31]+s*([St]|$)/';



[/QUOTE]


But the function is returning false when I am passing data with tab separated.


abc [email][email protected][/email]

qrs [email][email protected][/email]
[/QUOTE]
Copy linkTweet thisAlerts:
@phantom007authorJun 23.2013 — Apart from the issue reported in my last post, could you also split up your regex into variables so that if tomorrow I want to add a new character to allow/disallow I should be able to do it myself?

Thanks for the help
Copy linkTweet thisAlerts:
@NogDogJun 23.2013 — [code=php]
static $regex = '/(^|[St]) *?[^()<>@,;:\".[] 00-31][^()<>@,;:\"[] 00-31]*@[^()<>@,;:\"[] 00-31]*[^()<>@,;:\".[] 00-31]+ *([St]|$)/';
[/code]


However, this probably won't work as desired if you have an odd case where the separator is some character like ";" but you also have a tab character before/after it.
Copy linkTweet thisAlerts:
@phantom007authorJun 24.2013 — ok, I have updated the code and it still does not show the delimiter is tab.


Here is the code and attached is the csv i m testing.

[CODE] function getDelimiter($str, $debug = false)
{
static $regex = '/(^|[St]) *?[^()<>@,;:\".[] 00-31][^()<>@,;:\"[] 00-31]*@[^()<>@,;:\"[] 00-31]*[^()<>@,;:\".[] 00-31]+ *([St]|$)/';

if(preg_match($regex, $str, $matches)) {
if($debug) {
echo "<pre>Degbug: found email:n".var_export($matches,1)."</pre>n";
}
for($ix=1; $ix<=2; $ix++) {
if(!empty($matches[$ix]) and preg_match('/W/', $matches[$ix])) {
$delimiter = $matches[$ix];
$delimiter = ord($delimiter)==9 ? 'TAB' : $delimiter;
return $delimiter;
}
}
}

return false;
} [/CODE]



Thanks

[canned-message]attachments-removed-during-migration[/canned-message]
×

Success!

Help @phantom007 spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.29,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...