/    Sign up×
Community /Pin to ProfileBookmark

[RESOLVED] Work around for the 30 second execution limit

Hey guys,

I need a bit of help.

Without changing php.ini to increase execution time, is there any way I can get around it? I don’t mind if I have to click next each time to do it…

Basically I’m building a new search engine for our companies website. I’ve written a crawler script that will go through each page and log all words and occurrences. This takes a lot longer than 30 seconds given the number of pages we have.

The pages are read in from our cache pages and then logged since the cache pages are static HTML.

Here’s a small example of the setup I use to read the files

[code=php]

if (file_exists($filename)) {

if ($handle = opendir($filename)) {

while (false !== ($file = readdir($handle))) {

############## Here is where I open the file and do all the stuff with the text line by line.

}

closedir($handle);
}
}

[/code]

Can anyone think of a method of interrupting this, say, every 5 pages so that I have to click “Next” to proceed? Unless someone can think of another method. I don’t really want to change php.ini as the execution script could bring the server to a halt.

It’s our own server, but I don’t want to take the risk really.

to post a comment
PHP

19 Comments(s)

Copy linkTweet thisAlerts:
@SyCoAug 13.2008 — Set a start time at the top of your script, call your crawl (open url) function recursively and add a time check to it. If the time is getting close to your max_exec time show a button and halt the crawl.
Copy linkTweet thisAlerts:
@Kostas_ZotosAug 13.2008 — Hi,

Maybe this (change the time limit per script):

[ In the first lines of your php script (or anywhere, but before timeout) set the max time (in seconds) ]
[code=php]set_time_limit(60); // Set the max execution time allowed for the script (60 seconds -the default is usually 30 secs- ). [/code]

Kostas
Copy linkTweet thisAlerts:
@rootAug 13.2008 — Have you tried using PHP from the command line?
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — . our office server prevents us from connecting to our web server via command line. I've raised the issue with our IT company but they're unwilling to lift a finger to help in any way.. It took them more than a day just to open the port to view the administration page of our server.

Kostas. Changing the execution time isn't something I want to do. I have no idea how long the execution time will last and I don't want to bring the server to its knees waiting for it to finish a complete crawl which is why I'm opting to do it in steps.. That way I can take a break every 30 seconds instead of bogging down the server continually for minutes, maybe more, for each crawl of our 4 websites.

SyCo, how would I resume the process after halting the script?

To be honest I was going to use a counter to count 5 pages and then halt so that I don't end crawling mid page, but I need to think of a way to resume from page 6.

Except 6 will be a file name like index-option=news&id=5 and 7 will be index-option=products=1 for example.

They're not numbered but they are executed in the same order each time, which helps.

$filename is the folder I'm searching.

$file is a filename inside the folder I'm searching.

Typical list of files listed so you get the picture

.

..

index-option=news&id=1.html

index-option=news&id=2.html

index-option=news&page=2.html

index-option=products&category=1.html

index-option=static&id=1.html

index-option=stockists&page=3.html

I use validation to remove unwanted pages such as . and .. since they're not pages, and then other validation to remove pages I don't want to index at all.

The cache pages kick in when the database is offline or turned off to reduce bandwidth consumption.

The result is that they help me build a crawl index since the content will always be the same until a new cached version of the page is created.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — I've changed my approach... I've done this.

[code=php]
if (file_exists($directory)) {

if ($handle = opendir($directory)) {

while (false !== ($file = readdir($handle))) {

array_push($pages, $file);


}
closedir($handle);
}
}

foreach ($pages as $p => $f) {
echo $p . " : " . $f . "<br />";
}[/code]


That numbers everything for me, so the next time I can start at a specific point.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — That worked perfectly, if not a bit oddly.

[code=php]
$counter = ($page_id + $page_limit);
foreach ($pages as $p => $file) {
if ($p <= $counter && $p >= ($page_id - ($page_limit - 1))) {
############### INSERT words, etc... ###########
}

if ($p == $counter && $counter < count($pages)) {
echo "<a href='index.php?option=search&amp;task=crawl&amp;page=" . ($p + $page_limit) . "'>Next</a>";
}
}
[/code]

That executes and is perfect. It does 0 - 5, then 6-15, then 16-25.. A bit odd, but 10 is plenty and it executes in no time at all, so there's no need to fix what isn't broken.

However, this bit IS broken!

Notice: Undefined offset: 9 in C:wwwvhostslocalhostcomponentscrawl.php on line 106

Notice: Undefined offset: 2 in C:wwwvhostslocalhostcomponentscrawl.php on line 105

Now, line 106 and 105 are what breaks up the words.

[code=php]
for( $i = 0; $words[$i]; $i++ ){
for( $j = 0; $words[$i][$j]; $j++ ){
######### INSERT WORD, INSERT OCCURRENCE
}
}
[/code]


This part was taken from an online example for a simple search engine. I've heavily adapted it to suit my own websites needs, but the error is in the part that I didn't code.

I've contacted the author to see if he has any insight into the issue, but I have yet to receive a response.

http://www.devarticles.com/c/a/HTML/Building-A-Search-Engine/

It works regardless, but the error is just plainly annoying since it happens on virtually every line with different numbers, in various quantities.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — I wish the edit time was longer... I hate having to post multiple times.

Anyway, there was an error with the counter, it skiped 108 and 88 for some reason, so I recoded it to work properly.

[code=php]$start_page = $page_id * $page_limit;
$end_page = ($page_id + 1) * $page_limit - 1;
$next_page = $page_id + 1;

echo $start_page . " START PAGES . " . $end_page . " END PAGES <br /><br />";

foreach ($pages as $p => $file) {

if ($p >= $start_page && $p <= $end_page) {
echo $p . " : " . $file . "<br />";

}

if ($p == $end_page && $p < count($pages)) {
echo "<a href='index.php?option=search&amp;task=crawl&amp;page=" . $next_page . "'>Next</a>";
}
[/code]


Still getting the other error, but I've suppressed it with @.


EDIT (finally) another problem:

I want to exclude pages that contain name=something

An example of a filename is index-option=stockists&county=&name=bristol&x=0&y=0&.html where name=bristol

I want to exclude if name has a value, but if it has no value like... index-option=stockists&county=&name=&az=n&.html ... so then I want to keep it for crawling.

To exclude other pages I've done this:

if( !preg_match("/index-option=search/", $file) ) {}

This time, I need it to detect when there is something in name= which isn't & or nothing, so it would be like... !preg_match("/name=(?dontknow?)&/",$file)

What could I put that would say if it contains something? I'm probably going about it the wrong way completely.

Basically I want to allow files that contain name=&, but disallow files that contain name=SOMETHING&

I tried /name=(.*)&/ but that eliminates everything containing name= too.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — Sorted.. && !preg_match("/name=([a-zA-Z0-9/.,_~]+)&/",$file)
Copy linkTweet thisAlerts:
@rootAug 14.2008 — If your a machine on the same lan or network, the net admin or head admin should have no objections to adding you to a trust or as a trusted user to have access to those resources, it is by the sound of it your task to deliver a project and it is the net admin that is hindering your progress.

A little word in the right ear will soon have the net admin cooperating.

Network adimns have this god complex and see themselves as above everyone in that company yet they forget who pays their wages... put your argument in the right persons ears and why you need the access and they will have a word with the net admin.

Your alternative would be to have your deadline pass and push all the blame on the net admin who will be roasted at the next company BBQ and they will have to answer questions as to why he ignored your requests. Sort of puts them on the spot.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — If your a machine on the same lan or network, the net admin or head admin should have no objections to adding you to a trust or as a trusted user to have access to those resources, it is by the sound of it your task to deliver a project and it is the net admin that is hindering your progress.

A little word in the right ear will soon have the net admin cooperating.

Network adimns have this god complex and see themselves as above everyone in that company yet they forget who pays their wages... put your argument in the right persons ears and why you need the access and they will have a word with the net admin.

Your alternative would be to have your deadline pass and push all the blame on the net admin who will be roasted at the next company BBQ and they will have to answer questions as to why he ignored your requests. Sort of puts them on the spot.[/QUOTE]


Unfortunately that's not the case.

Our web server is hosted externally, and our network administrator is a third party company we pay to look after our systems (why I don't know, they barely do a thing)

I even asked them for a copy of our SLA with them not too long ago and they haven't provided me with a copy.
Copy linkTweet thisAlerts:
@rootAug 14.2008 — Unfortunately that's not the case.

Our web server is hosted externally, and our network administrator is a third party company we pay to look after our systems (why I don't know, they barely do a thing)

I even asked them for a copy of our SLA with them not too long ago and they haven't provided me with a copy.[/QUOTE]


That is a bit concerning as you have requested a set of terms and conditions relating to the service level agreement that they are legally required to provide on demand.

If your web server is hosted external, this should not make too much difference if your company has the passwords to the user CP, if they do not, then they need to get this information. Regarding your network admin... it sounds like he is a freelancer that works from home or the golf course.

You company, depending on how many PC's it operates, have in location at least 1 full time administrator and possibly 2 assistants to help with system configuration, maintaining it and resolving user problems. I am betting that the cost to the company is going to be less than your "when he can be bothered" administrator.

I know it won;t carry much weight but just say that I (having been a net admin) am shocked at your company who has handed the keys to another company. What happens if this guys company folds or he is killed in a car accident or his computers are stolen? What then?

Sounds like your company needs a large cup of not so latte and wake up to the fact that they have a hidden security issue and they need to address this, not only form the stand point of protecting the company but also from the point of view of productivity of the end users (you).
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — I set up the web server independantly of them, they're just blocking the ports I need to use command on the remote server.

This will shock you further. I'd have to agree about that freelancer statement. It really does feel like that.

Our company has approximately 25 computers, and our company has no administrator.

You are right though, I've said since day one of working here that this company has us by the balls in such a respect that it would cripple the business if we left them.

Our servers (except web server) are through them, our internet is through them, they look after all of our software licenses and have remote access to all computers on the network except mine (I disabled it).

Their company isn't that big, but it's not a one man job. If one of them died, another would take their place no doubt.

We've said on many occasions that we need to address the issue of our support from them. Since I started working here, I guess you could say that I'm the unofficial administrator, because they certainly aren't.

Last week we had a virus (because they didn't install network licenses for our anti virus that we paid for A YEAR AGO) and their solution was to buy a new PC because they couldn't get remote access.

Forget about the Windows XP recovery disc that sat on top of the tower, they wanted a new PC purchased via themselves and then payment to install all the software. A few hours go buy and I remove the virus, but if anything it's prompted several responses. 1 of which being that we're paying for anti-virus which we're not getting, another being who the hell is looking after our computers because they certainly aren't.

Next week is the end of my probation period at the company, so I'm hoping that I can discuss my role as the IT administrator here including a big pay increase for it since that's not what I was hired for.


Back on topic though, I'm trying to make a search keyword black list, a list of common keywords that I don't want to be included in the index.

[code=php]
if (strlen($cur_word) >= $smallest || !isset($common[$cur_word])) {

[/code]


$cur_word is the word that's going to be entered.

What i want it to do is exclude words under a certain length or if they are common.

[code=php]
## in a different file
define("common", array("but", "and", "maver"));

## in crawler
$common = common;
[/code]


The problem is obvious, but I can't think of a solution.. Coders block today I'm affraid.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — As is spelling block.
Copy linkTweet thisAlerts:
@SyCoAug 14.2008 — Our servers (except web server) are through them, our internet is through them, they look after all of our software licenses and have remote access to all computers on the network except mine (I disabled it).[/QUOTE]

Wow!

The bob's will only understand numbers, that's why it was farmed out in the first place. Before that pay grade meeting I would suggest calculating the cost of setting up an internal server room and employing someone to admin it or assist you. Balance that against the current cost of the hosting, which will be a small drop and that's all the bean counters see. Then factor in the man hours rates to fix and chase every issue and, for a little FUD, disaster recovery. Put a number on the risk factor or the current admin stealing and selling data, lost customer confidence and the marketability of increased security of an in house solution etc. If you get creative you should be able to make a 5 year plan for an internal server room look like the best option from a business plan point of view and financially too. I'm sure we could all come up with a few more ideas if you wanted to make a post about it. If nothing else you'll look like a serious security/business minded, forward thinking valuable member of staff, which can only help with the pay grade talks ?
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — Wow!

The bob's will only understand numbers, that's why it was farmed out in the first place. Before that pay grade meeting I would suggest calculating the cost of setting up an internal server room and employing someone to admin it or assist you. Balance that against the current cost of the hosting, which will be a small drop and that's all the bean counters see. Then factor in the man hours rates to fix and chase every issue and, for a little FUD, disaster recovery. Put a number on the risk factor or the current admin stealing and selling data, lost customer confidence and the marketability of increased security of an in house solution etc. If you get creative you should be able to make a 5 year plan for an internal server room look like the best option from a business plan point of view and financially too. I'm sure we could all come up with a few more ideas if you wanted to make a post about it. If nothing else you'll look like a serious security/business minded, forward thinking valuable member of staff, which can only help with the pay grade talks ?[/QUOTE]


I doubt they'd go for it since we have an internal server room already.. The storage cupboard. In general, hiring someone dedicated to looking after the computers is a no no for them since they're already paying another company to do that. Why pay &#163;30k or more a year for an IT admin when another company will do it for significantly less?

I think my colleagues are beginning to take me for granted now since they come to me with computer problems instead, which only saves us money since I get paid regardless. Still, if I put together a computer maintenance schedule (something we don't have.. Not a single PC has been defragged or updated since they were first purchased) I'm sure I can prove my worth for it.

At the end of the day, nobody noticed that they hadn't installed the anti-virus licenses until there was actually a problem a year later. I guess if a problem isn't brought to their attention, it's not a problem to worry about.

Without an IT administrator, how would they know any different?

PC slow? Get more RAM or a new PC... problem solved, everyone is happy.

I shouldn't need to prove my worth any more, I've done it enough.. Thinking on my feet, I should only really need to maintain the computers once every 3 months, 6 for the less troublesome computers. Preemptive solutions can be included of course.

Anyway, going back, can anyone see anything wrong with that code?

I think it might be the constant common.

Doing it like so... $common = array("but" => "but", "and" => "and", "maver" => "maver"); works

but not by using constant.
Copy linkTweet thisAlerts:
@maverukauthorAug 14.2008 — Edit: Case sensitive. All working.
Copy linkTweet thisAlerts:
@rootAug 14.2008 — This is the issue with companys who do not understand the implications of farming out to a 3rd party company the services that they rely on to operate.

Eggs in one basket.

What will need to happen is a catastrophic failure of the current network to wake your company up to the fact that it is very vulnerable.

If this admin company got a virus, then it goes without saying that your network of computers will be at risk an you already mention that your company has had a virus in the machine that they replaced with a new system because they couldn't connect to you network.

This to me says that the company have neglected to secure the network and because they neglected to install the software licences for the AV software, your network could be spewing lots of personal data about the company and all its clients all over the world.

Bean counters are not qualified to make a decision on PC networks, if they were then they would have the appropriate credentials to prove that they knew WTF they were doing.

What I can not believe is the caviler way in which the bean counters are putting everyone including the customers at risk under the justification of cost or cost effectiveness.

Another note is that you mentioned that this admin company sent another PC which your company paid for... Why? Your company is paying for administration not hardware. If this company that does your administration was on the ball, they would have sent a trained engineer to fix the issues very quickly and the computer option was a cheap skate way of dealing with the issue.

I take it that the machine that had the virus was the main uplink or domain controller, if this is the case, then that says to me that the admin company got hacked or infected with a virus.

I would certainly cast doubt on the admin company to do its job effectively.

$30k a year for a network admin is peanuts when you compare it to the loss of a machine in the network.

I could reel off a long list of issues with using an external administrator. This is fine if its a huge corporation, when I worked at Lear they had the network admin in the USA as well as a local administrator who looked after the network locally.

As your a small company, I am wondering why they are using an external company to do a job that should be done in house.

Lets assume that your currently paid peanuts for your efforts at around $15k, if they were to take you on full time, they should be giving you a raise to say something like $20k. All they need to do is invest another $10k and they have an in house administrator. So what is their objection?

I could go on more about how the company has dropped its nads in the vice of another company who IMHO is not providing you with adequate service or flexibility to do your jobs, it should be a case of your company dictating its needs and not the other way around, this company sounds like it is biting the hand that feeds it and getting away with it.
Copy linkTweet thisAlerts:
@SyCoAug 14.2008 — A big fat layer of FUD regarding the cost of disaster recovery would get the attention of the bean counters.

Q. "So Mr Bean Counter, how do you put a cash value on the damage to company reputation after the network is hacked and the company's confidential records exposed."

Honestly if you bring all this to their attention and they still don't get it, I would start making inquiries about working somewhere else. Hell, I'd do that before going into the pay neg anyway.

Good luck with that, don't under value yourself!
Copy linkTweet thisAlerts:
@maverukauthorAug 15.2008 — The said computer was actually just a sales office computer. The proposed down time from the 3p company was several days to get new hardware and for them to come out and install it (costing &#163;1000s no doubt). I fixed it in a day without issue.

You're absolutely right though, and when it comes to my pay eval I'm going to ask them to increase my pay to reflect upon my extra duties because they're getting me to do the work of this 3p company now for no additional cost. I'm just hired to do the website, and even that was done externally before I was hired by another company. In the end, they needed someone full time, so perhaps I can use that pursuasion.

At the end of the day, I am under paid for what I currently do, but the trade off was that it was a local job and I had been out of work, so I was more than happy to take it. Regardless of whether I was going to be IT admin here I was hoping for a pay increase.
×

Success!

Help @maveruk spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 5.29,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @AriseFacilitySolutions09,
tipped: article
amount: 1000 SATS,

tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,
)...