Maximum elapsed time exceeded

Message boards : Number crunching : Maximum elapsed time exceeded

Author Message
DJStarfox
Send message
Joined: Nov 26 07
Posts: 13
Credit: 144,924
RAC: 46

6.10.45

Maximum elapsed time exceeded


Warning: will use heterogeneity workaround.


]]>


What is going on? Since my BOINC client downloaded the newest science app yesterday, I have nothing but errors.
https://malariacontrol.net/result.php?resultid=62988924

Profile Conan
Avatar
Send message
Joined: Mar 24 09
Posts: 14
Credit: 112,965
RAC: 76

All my Windows work units from last night and today are getting this error. All OK yesterday.

Linux work units are processing OK (just give very low credits, under 5 per WU)

Conan
____________

Profile Conan
Avatar
Send message
Joined: Mar 24 09
Posts: 14
Credit: 112,965
RAC: 76

All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days.
____________

John.Sawyer
Send message
Joined: Sep 12 09
Posts: 1
Credit: 99,792
RAC: 0

Same here

As well as getting the elapsed time message, I just had 40K credits knocked off.
See Guillaume Gnaegi Here:https://malariacontrol.net/forum_thread.php?id=1144#14654 for an explanation of the credit problem

I don't want to waste ticks, so I've suspended Malaria Control task pickup until this problem has been resolved.
____________

Profile Conan
Avatar
Send message
Joined: Mar 24 09
Posts: 14
Credit: 112,965
RAC: 76

Does anyone know if this issue is fixed?

I am not going to reconnect my Windows computers till I know it is all working due to the number of failed work units I have sent back the last two days.

Conan
____________

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,193,234
RAC: 1,472

I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue.

What is happening?
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

Profile Conan
Avatar
Send message
Joined: Mar 24 09
Posts: 14
Credit: 112,965
RAC: 76

Now that you mention it that is also about the time I started to get the problems as well.
My first few work units I processed for the project under Windows ran fine, now they don't.
So could be an issue with what the project did to remove credit and it has affected Windows work units.

Conan
____________

Profile Saenger
Avatar
Send message
Joined: Mar 8 06
Posts: 55
Credit: 143,384
RAC: 28

I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue.

What is happening?

That seems to be the same error that made a lot of my WU crash before and during the credit disaster. It's probably unrelated, as the credits are granted on server, and this happens on distributed machines.

The issue doesn't seem to be fixed, though I received some that took more than one hour and didn't get it.
____________
Grüße vom Sänger

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

The formula to work out the maximum allowed elapsed time uses the following XML tags from client_state.xml:


  • from the section for MCDN.

  • from the openMalariaA v6.51 section.

  • from the section (fixed at 150000000000000.000000).
    For the estimated elapsed time is used instead (now back to being fixed at 20000000000000.000000).


The value for depends on your benchmark and will be different on each system (for openMalariaA v6.51 it's 1403434822.298770 on my C2Q Q6600 XP system and 2142322887.568957 on my C2D T7300 Vista system).

The maximum allowed elapsed time is calculated using the formula

/ *


Taking a task queued up on my Q6600 as an example, the formula for maximum elapsed time is

150000000000000 / 1403434822.298770 * 0.543255 = 58063.44 seconds (16:07:43)


and for estimated elapsed time it's

20000000000000 / 1403434822.298770 * 0.543255 = 7741.79 seconds (2:09:02)


(DCF is should be smaller for openMalaria branch A because most of those tasks are currently completing in between 30 and 90 minutes, but branch B tasks are taking longer than their estimated time and pushing DCF back up)
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile OlaV_Ouafouaf
Send message
Joined: Jun 20 06
Posts: 5
Credit: 202,527
RAC: 32

All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days.


I too ...
____________

Mark Woodruff
Send message
Joined: Nov 14 10
Posts: 1
Credit: 742,062
RAC: 426

Getting similar errors here too...1417 seconds seems pretty low as the max. elapsed time. Once the job hits 1417 seconds it aborts.

Gonna suspend because there's no point in running them if they're consistently aborting - and they are.

FYI, Windows 7 Ult. 64bit on an AMD Phenom 9750 and 8Gb of RAM.

I'll check in daily to see if it's been resolved.

Mark

DJStarfox
Send message
Joined: Nov 26 07
Posts: 13
Credit: 144,924
RAC: 46

Uhh... I'm running Linux x86_64, but I'm still getting this error.

BobCat13
Send message
Joined: Jan 4 07
Posts: 6
Credit: 153,366
RAC: 105

I may have found the problem.

Here are some numbers from my Athlon X2 6000+ pulling from the sections in the client_state.xml

benchmark
3033420905.978824

minirosetta
3066968503.937008

simap
2840809069.112005

poem
3023016031.503862

openMalariaA
50069545112.719414

Notice how the first three applications are similar to the benchmark, but the openMalariaA is almost 17 times as high. Using the calculation listed by Thyme Lawn but dropping the DCF portion, I get:

150000000000000 / 50069545112.719414 = 2995.833 sec

which matches the elapsed time when the task errored (49:55 = 2995 sec).

The new runtime estimation does not use DCF, as per this link:

http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation

The question is why does openMalariaA have such a high number in ?

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

The new runtime estimation does not use DCF

But only if you're running an alpha test version of BOINC (6.11.* or 6.12.*). Older versions still use DCF.

The question is why does openMalariaA have such a high number in ?

That confirms my suspicion that something is wrong in the host_app_version table records for the hosts which are suffering from the maximum elapsed time problem. I'm not sure what could have caused that to happen, but it can only be fixed on the server.

Users who are running an alpha test version of BOINC will have to wait until the hav records are modified, but those who're running 6.10.* or earlier could try increasing the DCF for MCDN in client_state.xml.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile OlaV_Ouafouaf
Send message
Joined: Jun 20 06
Posts: 5
Credit: 202,527
RAC: 32

As explain here I have found a work around :

I may have found the problem.

Here are some numbers from my Athlon X2 6000+ pulling from the sections in the client_state.xml

benchmark
3033420905.978824

minirosetta
3066968503.937008

simap
2840809069.112005

poem
3023016031.503862

openMalariaA
50069545112.719414

Notice how the first three applications are similar to the benchmark, but the openMalariaA is almost 17 times as high. Using the calculation listed by Thyme Lawn but dropping the DCF portion, I get:

150000000000000 / 50069545112.719414 = 2995.833 sec

which matches the elapsed time when the task errored (49:55 = 2995 sec).

The new runtime estimation does not use DCF, as per this link:

http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation

The question is why does openMalariaA have such a high number in ?


I manually reduce the "flops" value.
- Stop boinc
- edit client_state.xml AND client_state_prev.xml with the notepad editor and put a much lower value (in my case I divide per 4)
- Save and restart boinc

It's work now for two days.

____________

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi.

I just had this happen, 2 tasks ran for 3hrs 34min then erred, a waste of time me thinks.

Both had got to > 90% done, you might want to increase the time limit to ~ 4hrs

maybe? so they can finish and for me to get the credits to :)

Wed 02 Feb 2011 17:53:13 EST malariacontrol.net Aborting task wu_899_517_252474_0_1296534263_0: exceeded elapsed time limit 12863.775025
Wed 02 Feb 2011 17:53:22 EST malariacontrol.net Computation for task wu_899_517_252474_0_1296534263_0 finished
Wed 02 Feb 2011 17:53:22 EST malariacontrol.net Output file wu_899_517_252474_0_1296534263_0_0 for task wu_899_517_252474_0_1296534263_0 absent

============================================================================

Wed 02 Feb 2011 18:06:07 EST malariacontrol.net Aborting task wu_899_516_252474_0_1296534263_0: exceeded elapsed time limit 12863.775025
Wed 02 Feb 2011 18:06:08 EST malariacontrol.net Computation for task wu_899_516_252474_0_1296534263_0 finished
Wed 02 Feb 2011 18:06:08 EST malariacontrol.net Output file wu_899_516_252474_0_1296534263_0_0 for task wu_899_516_252474_0_1296534263_0 absent
____________

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Hi P.P.L.,

The problem is due to the cpu time exceed (rsc_fpops_bound boinc parameter : limit on FLOPS, after which the job will be aborted) wich was too low for the jobs.
In fact, with the results coming back from openMalariaB, the input parameters for openMalariaB have evolved in a way that for a small amount of jobs it's taking more CPU Time than before.

We have now increase the rsc_fpops_bound parameter like that next generated jobs will not reach the new limit so you can finish the jobs and get your credits :)

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi michaelT.

(quote) We have now increase the rsc_fpops_bound parameter like that next generated jobs will not reach the new limit so you can finish the jobs and get your credits :) (/quote)

Verrrry gooood. ;)

____________

BOINCstats.SOFA
Send message
Joined: Feb 2 11
Posts: 1
Credit: 3,632
RAC: 0

This problem still exists also with newer tasks and is happening hundreds of times. I've seen tasks with those errors after 9 hours runtime :(

Some examples:
https://malariacontrol.net/result.php?resultid=69547499
https://malariacontrol.net/result.php?resultid=69508082
https://malariacontrol.net/result.php?resultid=69470689

Possibly you increased the rsc_fpops_bound not enough, but you also give these tasks much too low estimated runtime, so also the rsc_fpops_est is far to low.

A second problem for us crunchers is that we have no idea why we are receiving a resend.
Until a WU is completed (2 valid results) we are looking to the Workunit information and we only get: suppressed pending completion.
Very unfriendly, because we do not have a chance to see why a task was error-ing out on another machine or whether there is already a wingman waiting with a valid result.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

Until a WU is completed (2 valid results) we are looking to the Workunit information and we only get: suppressed pending completion.
Very unfriendly, because we do not have a chance to see why a task was error-ing out on another machine or whether there is already a wingman waiting with a valid result.

Agreed on the unfriendliness, but ...

The project is using adaptive replication to make more efficient use of user CPU time. That causes the BOINC server code to suppress a workunit's details until it has been validated. The comment in the code for the test which generates the message includes the following:

so that bad guys can't tell if they have an unreplicated job

i.e. the vast majority of users are inconvenienced because a few people will try anything to inflate their credits :(
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0


This problem still exists also with newer tasks and is happening hundreds of times. I've seen tasks with those errors after 9 hours runtime :(

Some examples:
https://malariacontrol.net/result.php?resultid=69547499
https://malariacontrol.net/result.php?resultid=69508082
https://malariacontrol.net/result.php?resultid=69470689

Possibly you increased the rsc_fpops_bound not enough, but you also give these tasks much too low estimated runtime, so also the rsc_fpops_est is far to low.


We increased the rsc_fpops_bound for some of the jobs (not all) but it seems that other jobs had also exceed the rsc_fpops_bound. So after your message have been posted, the rsc_fpops_bound have been changed for the rest of the jobs and now everyting is back to normal. :)

For the second problem, I think Thyme Lawn answered you ( @Thyme Lawn : thanks :) )

biodoc
Send message
Joined: Jan 14 08
Posts: 17
Credit: 602,969
RAC: 0

I have 2 computers (1090T & 2600K), both with 64 bit ubuntu 11.4 linux as the OS.

I'm getting the error described in this thread on a significant # of WUs in the last few days.

6.10.58

Maximum elapsed time exceeded




]]>

Here's a list on one computer:

https://malariacontrol.net/results.php?userid=17301&offset=0&show_names=0&state=5

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,376
RAC: 129

Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate.

biodoc
Send message
Joined: Jan 14 08
Posts: 17
Credit: 602,969
RAC: 0

Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate.


I'm pretty sure I disabled the "Turbo boost" mode in the bios but I'll check the next time I reboot. All 8 threads are running malaria and one other DC project. I did clock back both machines a month or so back due to the hot summer weather. I ran the Boinc CPU benchmarks yesterday just to make sure they are in line with the lower clock settings.

biodoc
Send message
Joined: Jan 14 08
Posts: 17
Credit: 602,969
RAC: 0

I check the "top hosts" page and most of the top computers are having the same problem/error to varying degrees. Here's one for example.

https://malariacontrol.net/results.php?hostid=190336&offset=0&show_names=0&state=5

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,376
RAC: 129

Well, I've increased the limit (will only affect new workunits). It seems a bit odd if you compare the host's FPOps/sec times run-time with the bound (was 3*10^14, now 1.5*10^15), but lets see whether this increase solves the problem before doing anything else.

TylerChris
Send message
Joined: Mar 29 07
Posts: 23
Credit: 513,393
RAC: 2

Same thing here,started about a week ago after crunching a shed load of the small CMerit Wus.
This only happens with the "B" units, they all now fail only CMerits still get through.
My temporary answer is to disable the "B" units from the machine for the time being.
Chris

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

Something has caused Chris's average processing rate for openMalariaB v6.57 to explode up to just over 310 (see here).

In the BOINC data directory you'll find the file job_log_www.malariacontrol.net.txt which records the times for every completed task. The format of the file is:


  • Unix time stamp when the task completed
  • ue followed by the estimated runtime without a DCF adjustment.
  • ct followed by the CPU time
  • fe followed by the rsc_fpops_est value sent by the server
  • nm followed by the task name
  • et followed by the elapsed run time


Unfortunately the file doesn't identify the application used, but based on the files on my systems it looks like ue for the B tasks is significantly less than for A tasks.

Excluding all of the A tasks from the file it's clear that "normal" and wuCMreint B tasks are pulling the average processing rate in opposite directions. I'll use my Q6600 to illustrate:


  • between 16th and 22nd June it ran 50 wuCMreint tasks using openMalaria test version 6.57. They all had rsc_fpops_est (fe) set to 15,000,000,000,000. The initial ue value was almost 510 seconds and it gradually reduced until the 18th task when there was a massive drop from 478.5 to 43.6 seconds. That coincided with rsc_fpops_est being reduced by a factor of 10. On the last task ue had reduced to 41 seconds which was in the right sort of range (tasks ran for between 18 and 94 seconds).

  • on 19th July it received the first of the 42 wuCMreint tasks run with openMalaria B version 6.57. It had rsc_fpops_est set to 11,000,000,000,000 and ue was 1007 seconds. The run time for these tasks has ranged from 19 to 110 seconds and ue had reduced to 804 seconds for the last one. So ue > et for wuCMreint tasks.

  • by way of contrast, "normal" B tasks mostly have ue < et, but a number of recent wu_1200_* tasks have run for over 4 hours instead of the 30 to 60 minutes required for most tasks.


The application details page for my Q6600 has the average processing rate for test and B v6.57 set to 0.1826 and 10.5963 respectively.

Based on this it would appear that the 2 types of B tasks are pulling the average processing rate in different directions and that significantly reducing rsc_fpops_est for wuCMreint tasks (by a factor of around 10) would improve things.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile Krunchin-Keith [USA]
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: Nov 10 05
Posts: 3217
Credit: 5,500,753
RAC: 3,644

Can this be related to the cause

I found this in the sched_reply_www.malariacontol.net file

A 9965953386.957493
B 16585845220.902262

But it appears all workunits (A & B) use the same
21000000000000.000000 estimate value

I have various estimate run times
On a 12 core
A is 3 hours
B is 2 hours
Generally these run 30 minutes but some do take 2 hours.

On a 2 core
A is 7 hours
B is 3 hours
Generally on this one they take 45 minutes with some going 2-3 hours.

I think this difference in A and B causes the DCF to be varied greatly too and cause the 'high priority' a lot of people report.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

I think this difference in A and B causes the DCF to be varied greatly too and cause the 'high priority' a lot of people report.

The average processing rates shown on a host's application version page should smooth out the differences between A and B (the more tasks the host has run the better the smoothing will be).

High priority running is mainly caused by tasks which take significantly longer than the estimated run time. They immediately inflate DCF but it reduces at a much slower rate (by less than 10% on each successful task completion).

Overnight B task wu_1200_415_12108_0_1312048809_0 had a run time of 5:22:16 on my Q6600. That bumped DCF up to over 10.293981 and caused 2 A tasks to immediately start running high priority. Tasks are now taking a lot shorter than the DCF adjusted estimates. For example, an A task has just completed with a run time of 0:54:51 causing:

  • DCF to fall from 10.106363 to 9.197092
  • the estimated run time to fall from 9:06:58 to 8:17:45 for A tasks and from 5:16:04 to 4:47:56 for B tasks


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,376
RAC: 129

So bad rst_fpops_est figures bite again. Thanks for figuring this out. Well, the good news is that these CMreint workunits are nearly over (though boy, does processing 180000 results on the server take a long time)! The bad news is that we still don't have a better way of estimating work-unit run-time. It shouldn't be impossible to do better than we currently do; it's just one of those things we've not found the time to do. Sorry that I can't offer any promises of getting to this; our software group is currently downsizing.

John C MacAlister
Send message
Joined: Feb 20 11
Posts: 20
Credit: 192,407
RAC: 0

We need a project to solve the disease of downsizing now running rampant all over the world. :(

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

We need a project to solve the disease of downsizing now running rampant all over the world. :(


So essentially we need to stop doing more with less? Or maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!!

Warped
Avatar
Send message
Joined: Aug 1 10
Posts: 22
Credit: 207,998
RAC: 501

... maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!!


Indeed.

The very fabric of society will be undermined if this alarming trend is not reversed or at least slowed. Malaria treatment regimes would also be of reduced benefit.

____________
Warped

Post to thread

Message boards : Number crunching : Maximum elapsed time exceeded


Return to malariacontrol.net main page


Copyright © 2013 africa@home