Maximum elapsed time exceeded

Message boards : Number crunching : Maximum elapsed time exceeded

Author	Message
DJStarfox Send message Joined: Nov 26 07 Posts: 13 Credit: 144,924 RAC: 46
	6.10.45 Maximum elapsed time exceeded Warning: will use heterogeneity workaround. ]]> What is going on? Since my BOINC client downloaded the newest science app yesterday, I have nothing but errors. https://malariacontrol.net/result.php?resultid=62988924
	ID: 14567 \| Rating: 0 \| rate: / Reply Quote

Conan Send message Joined: Mar 24 09 Posts: 14 Credit: 112,965 RAC: 76
	All my Windows work units from last night and today are getting this error. All OK yesterday. Linux work units are processing OK (just give very low credits, under 5 per WU) Conan ____________
	ID: 14691 \| Rating: 0 \| rate: / Reply Quote

Conan Send message Joined: Mar 24 09 Posts: 14 Credit: 112,965 RAC: 76
	All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days. ____________
	ID: 14707 \| Rating: 0 \| rate: / Reply Quote

John.Sawyer Send message Joined: Sep 12 09 Posts: 1 Credit: 99,792 RAC: 0
	Same here As well as getting the elapsed time message, I just had 40K credits knocked off. See Guillaume Gnaegi Here:https://malariacontrol.net/forum_thread.php?id=1144#14654 for an explanation of the credit problem I don't want to waste ticks, so I've suspended Malaria Control task pickup until this problem has been resolved. ____________
	ID: 14712 \| Rating: 0 \| rate: / Reply Quote

Conan Send message Joined: Mar 24 09 Posts: 14 Credit: 112,965 RAC: 76
	Does anyone know if this issue is fixed? I am not going to reconnect my Windows computers till I know it is all working due to the number of failed work units I have sent back the last two days. Conan ____________
	ID: 14737 \| Rating: 0 \| rate: / Reply Quote

John Clark Send message Joined: Feb 10 08 Posts: 2149 Credit: 1,193,234 RAC: 1,472
	I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue. What is happening? ____________ Go away, I was asleep Said a Russell, 3 Shih-Tzus & a Bischeon Frize
	ID: 14739 \| Rating: 0 \| rate: / Reply Quote

Conan Send message Joined: Mar 24 09 Posts: 14 Credit: 112,965 RAC: 76
	Now that you mention it that is also about the time I started to get the problems as well. My first few work units I processed for the project under Windows ran fine, now they don't. So could be an issue with what the project did to remove credit and it has affected Windows work units. Conan ____________
	ID: 14743 \| Rating: 0 \| rate: / Reply Quote

Saenger Send message Joined: Mar 8 06 Posts: 55 Credit: 143,384 RAC: 28
	I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue. What is happening? That seems to be the same error that made a lot of my WU crash before and during the credit disaster. It's probably unrelated, as the credits are granted on server, and this happens on distributed machines. The issue doesn't seem to be fixed, though I received some that took more than one hour and didn't get it. ____________ Grüße vom Sänger
	ID: 14745 \| Rating: 0 \| rate: / Reply Quote

Thyme Lawn Send message Joined: Jun 20 06 Posts: 181 Credit: 1,233,724 RAC: 1,389
	The formula to work out the maximum allowed elapsed time uses the following XML tags from client_state.xml: from the section for MCDN. from the openMalariaA v6.51 section. from the section (fixed at 150000000000000.000000). For the estimated elapsed time is used instead (now back to being fixed at 20000000000000.000000). The value for depends on your benchmark and will be different on each system (for openMalariaA v6.51 it's 1403434822.298770 on my C2Q Q6600 XP system and 2142322887.568957 on my C2D T7300 Vista system). The maximum allowed elapsed time is calculated using the formula / * Taking a task queued up on my Q6600 as an example, the formula for maximum elapsed time is 150000000000000 / 1403434822.298770 * 0.543255 = 58063.44 seconds (16:07:43) and for estimated elapsed time it's 20000000000000 / 1403434822.298770 * 0.543255 = 7741.79 seconds (2:09:02) (DCF is should be smaller for openMalaria branch A because most of those tasks are currently completing in between 30 and 90 minutes, but branch B tasks are taking longer than their estimated time and pushing DCF back up) ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
	ID: 14748 \| Rating: 0 \| rate: / Reply Quote

OlaV_Ouafouaf Send message Joined: Jun 20 06 Posts: 5 Credit: 202,527 RAC: 32
	All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days. I too ... ____________
	ID: 14752 \| Rating: 0 \| rate: / Reply Quote

Mark Woodruff Send message Joined: Nov 14 10 Posts: 1 Credit: 742,062 RAC: 426
	Getting similar errors here too...1417 seconds seems pretty low as the max. elapsed time. Once the job hits 1417 seconds it aborts. Gonna suspend because there's no point in running them if they're consistently aborting - and they are. FYI, Windows 7 Ult. 64bit on an AMD Phenom 9750 and 8Gb of RAM. I'll check in daily to see if it's been resolved. Mark
	ID: 14759 \| Rating: 0 \| rate: / Reply Quote

DJStarfox Send message Joined: Nov 26 07 Posts: 13 Credit: 144,924 RAC: 46
	Uhh... I'm running Linux x86_64, but I'm still getting this error.
	ID: 14760 \| Rating: 0 \| rate: / Reply Quote

BobCat13 Send message Joined: Jan 4 07 Posts: 6 Credit: 153,366 RAC: 105
	I may have found the problem. Here are some numbers from my Athlon X2 6000+ pulling from the sections in the client_state.xml benchmark 3033420905.978824 minirosetta 3066968503.937008 simap 2840809069.112005 poem 3023016031.503862 openMalariaA 50069545112.719414 Notice how the first three applications are similar to the benchmark, but the openMalariaA is almost 17 times as high. Using the calculation listed by Thyme Lawn but dropping the DCF portion, I get: 150000000000000 / 50069545112.719414 = 2995.833 sec which matches the elapsed time when the task errored (49:55 = 2995 sec). The new runtime estimation does not use DCF, as per this link: http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation The question is why does openMalariaA have such a high number in ?
	ID: 14762 \| Rating: 0 \| rate: / Reply Quote

Thyme Lawn Send message Joined: Jun 20 06 Posts: 181 Credit: 1,233,724 RAC: 1,389
	The new runtime estimation does not use DCF But only if you're running an alpha test version of BOINC (6.11.* or 6.12.). Older versions still use DCF. The question is why does openMalariaA have such a high number in ? That confirms my suspicion that something is wrong in the host_app_version table records for the hosts which are suffering from the maximum elapsed time problem. I'm not sure what could have caused that to happen, but it can only be fixed on the server. Users who are running an alpha test version of BOINC will have to wait until the hav records are modified, but those who're running 6.10. or earlier could try increasing the DCF for MCDN in client_state.xml. ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
	ID: 14764 \| Rating: 0 \| rate: / Reply Quote

OlaV_Ouafouaf Send message Joined: Jun 20 06 Posts: 5 Credit: 202,527 RAC: 32
	As explain here I have found a work around : I may have found the problem. Here are some numbers from my Athlon X2 6000+ pulling from the sections in the client_state.xml benchmark 3033420905.978824 minirosetta 3066968503.937008 simap 2840809069.112005 poem 3023016031.503862 openMalariaA 50069545112.719414 Notice how the first three applications are similar to the benchmark, but the openMalariaA is almost 17 times as high. Using the calculation listed by Thyme Lawn but dropping the DCF portion, I get: 150000000000000 / 50069545112.719414 = 2995.833 sec which matches the elapsed time when the task errored (49:55 = 2995 sec). The new runtime estimation does not use DCF, as per this link: http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation The question is why does openMalariaA have such a high number in ? I manually reduce the "flops" value. - Stop boinc - edit client_state.xml AND client_state_prev.xml with the notepad editor and put a much lower value (in my case I divide per 4) - Save and restart boinc It's work now for two days. ____________
	ID: 14800 \| Rating: 0 \| rate: / Reply Quote

P . P . L . Send message Joined: Aug 27 08 Posts: 56 Credit: 500,976 RAC: 0
	Hi. I just had this happen, 2 tasks ran for 3hrs 34min then erred, a waste of time me thinks. Both had got to > 90% done, you might want to increase the time limit to ~ 4hrs maybe? so they can finish and for me to get the credits to :) Wed 02 Feb 2011 17:53:13 EST malariacontrol.net Aborting task wu_899_517_252474_0_1296534263_0: exceeded elapsed time limit 12863.775025 Wed 02 Feb 2011 17:53:22 EST malariacontrol.net Computation for task wu_899_517_252474_0_1296534263_0 finished Wed 02 Feb 2011 17:53:22 EST malariacontrol.net Output file wu_899_517_252474_0_1296534263_0_0 for task wu_899_517_252474_0_1296534263_0 absent ============================================================================ Wed 02 Feb 2011 18:06:07 EST malariacontrol.net Aborting task wu_899_516_252474_0_1296534263_0: exceeded elapsed time limit 12863.775025 Wed 02 Feb 2011 18:06:08 EST malariacontrol.net Computation for task wu_899_516_252474_0_1296534263_0 finished Wed 02 Feb 2011 18:06:08 EST malariacontrol.net Output file wu_899_516_252474_0_1296534263_0_0 for task wu_899_516_252474_0_1296534263_0 absent ____________
	ID: 15231 \| Rating: 0 \| rate: / Reply Quote

michaelT Volunteer moderator Project administrator Project developer Project scientist Send message Joined: Jul 20 10 Posts: 47 Credit: 16,359 RAC: 0
	Hi P.P.L., The problem is due to the cpu time exceed (rsc_fpops_bound boinc parameter : limit on FLOPS, after which the job will be aborted) wich was too low for the jobs. In fact, with the results coming back from openMalariaB, the input parameters for openMalariaB have evolved in a way that for a small amount of jobs it's taking more CPU Time than before. We have now increase the rsc_fpops_bound parameter like that next generated jobs will not reach the new limit so you can finish the jobs and get your credits :)
	ID: 15232 \| Rating: 0 \| rate: / Reply Quote

P . P . L . Send message Joined: Aug 27 08 Posts: 56 Credit: 500,976 RAC: 0
	Hi michaelT. (quote) We have now increase the rsc_fpops_bound parameter like that next generated jobs will not reach the new limit so you can finish the jobs and get your credits :) (/quote) Verrrry gooood. ;) ____________
	ID: 15235 \| Rating: 0 \| rate: / Reply Quote

BOINCstats.SOFA Send message Joined: Feb 2 11 Posts: 1 Credit: 3,632 RAC: 0
	This problem still exists also with newer tasks and is happening hundreds of times. I've seen tasks with those errors after 9 hours runtime :( Some examples: https://malariacontrol.net/result.php?resultid=69547499 https://malariacontrol.net/result.php?resultid=69508082 https://malariacontrol.net/result.php?resultid=69470689 Possibly you increased the rsc_fpops_bound not enough, but you also give these tasks much too low estimated runtime, so also the rsc_fpops_est is far to low. A second problem for us crunchers is that we have no idea why we are receiving a resend. Until a WU is completed (2 valid results) we are looking to the Workunit information and we only get: suppressed pending completion. Very unfriendly, because we do not have a chance to see why a task was error-ing out on another machine or whether there is already a wingman waiting with a valid result.
	ID: 15274 \| Rating: 0 \| rate: / Reply Quote

Thyme Lawn Send message Joined: Jun 20 06 Posts: 181 Credit: 1,233,724 RAC: 1,389
	Until a WU is completed (2 valid results) we are looking to the Workunit information and we only get: suppressed pending completion. Very unfriendly, because we do not have a chance to see why a task was error-ing out on another machine or whether there is already a wingman waiting with a valid result. Agreed on the unfriendliness, but ... The project is using adaptive replication to make more efficient use of user CPU time. That causes the BOINC server code to suppress a workunit's details until it has been validated. The comment in the code for the test which generates the message includes the following: so that bad guys can't tell if they have an unreplicated job i.e. the vast majority of users are inconvenienced because a few people will try anything to inflate their credits :( ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
	ID: 15276 \| Rating: 0 \| rate: / Reply Quote

michaelT Volunteer moderator Project administrator Project developer Project scientist Send message Joined: Jul 20 10 Posts: 47 Credit: 16,359 RAC: 0
	This problem still exists also with newer tasks and is happening hundreds of times. I've seen tasks with those errors after 9 hours runtime :( Some examples: https://malariacontrol.net/result.php?resultid=69547499 https://malariacontrol.net/result.php?resultid=69508082 https://malariacontrol.net/result.php?resultid=69470689 Possibly you increased the rsc_fpops_bound not enough, but you also give these tasks much too low estimated runtime, so also the rsc_fpops_est is far to low. We increased the rsc_fpops_bound for some of the jobs (not all) but it seems that other jobs had also exceed the rsc_fpops_bound. So after your message have been posted, the rsc_fpops_bound have been changed for the rest of the jobs and now everyting is back to normal. :) For the second problem, I think Thyme Lawn answered you ( @Thyme Lawn : thanks :) )
	ID: 15355 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: Jan 14 08 Posts: 17 Credit: 602,969 RAC: 0
	I have 2 computers (1090T & 2600K), both with 64 bit ubuntu 11.4 linux as the OS. I'm getting the error described in this thread on a significant # of WUs in the last few days. 6.10.58 Maximum elapsed time exceeded ]]> Here's a list on one computer: https://malariacontrol.net/results.php?userid=17301&offset=0&show_names=0&state=5
	ID: 16246 \| Rating: 0 \| rate: / Reply Quote

hardy Volunteer moderator Project administrator Project developer Send message Joined: Feb 18 09 Posts: 141 Credit: 54,376 RAC: 129
	Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate.
	ID: 16247 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: Jan 14 08 Posts: 17 Credit: 602,969 RAC: 0
	Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate. I'm pretty sure I disabled the "Turbo boost" mode in the bios but I'll check the next time I reboot. All 8 threads are running malaria and one other DC project. I did clock back both machines a month or so back due to the hot summer weather. I ran the Boinc CPU benchmarks yesterday just to make sure they are in line with the lower clock settings.
	ID: 16249 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: Jan 14 08 Posts: 17 Credit: 602,969 RAC: 0
	I check the "top hosts" page and most of the top computers are having the same problem/error to varying degrees. Here's one for example. https://malariacontrol.net/results.php?hostid=190336&offset=0&show_names=0&state=5
	ID: 16250 \| Rating: 0 \| rate: / Reply Quote

hardy Volunteer moderator Project administrator Project developer Send message Joined: Feb 18 09 Posts: 141 Credit: 54,376 RAC: 129
	Well, I've increased the limit (will only affect new workunits). It seems a bit odd if you compare the host's FPOps/sec times run-time with the bound (was 310^14, now 1.510^15), but lets see whether this increase solves the problem before doing anything else.
	ID: 16252 \| Rating: 0 \| rate: / Reply Quote

TylerChris Send message Joined: Mar 29 07 Posts: 23 Credit: 513,393 RAC: 2
	Same thing here,started about a week ago after crunching a shed load of the small CMerit Wus. This only happens with the "B" units, they all now fail only CMerits still get through. My temporary answer is to disable the "B" units from the machine for the time being. Chris
	ID: 16259 \| Rating: 0 \| rate: / Reply Quote

Thyme Lawn Send message Joined: Jun 20 06 Posts: 181 Credit: 1,233,724 RAC: 1,389
	Something has caused Chris's average processing rate for openMalariaB v6.57 to explode up to just over 310 (see here). In the BOINC data directory you'll find the file job_log_www.malariacontrol.net.txt which records the times for every completed task. The format of the file is: Unix time stamp when the task completed ue followed by the estimated runtime without a DCF adjustment. ct followed by the CPU time fe followed by the rsc_fpops_est value sent by the server nm followed by the task name et followed by the elapsed run time Unfortunately the file doesn't identify the application used, but based on the files on my systems it looks like ue for the B tasks is significantly less than for A tasks. Excluding all of the A tasks from the file it's clear that "normal" and wuCMreint B tasks are pulling the average processing rate in opposite directions. I'll use my Q6600 to illustrate: between 16th and 22nd June it ran 50 wuCMreint tasks using openMalaria test version 6.57. They all had rsc_fpops_est (fe) set to 15,000,000,000,000. The initial ue value was almost 510 seconds and it gradually reduced until the 18th task when there was a massive drop from 478.5 to 43.6 seconds. That coincided with rsc_fpops_est being reduced by a factor of 10. On the last task ue had reduced to 41 seconds which was in the right sort of range (tasks ran for between 18 and 94 seconds). on 19th July it received the first of the 42 wuCMreint tasks run with openMalaria B version 6.57. It had rsc_fpops_est set to 11,000,000,000,000 and ue was 1007 seconds. The run time for these tasks has ranged from 19 to 110 seconds and ue had reduced to 804 seconds for the last one. So ue > et for wuCMreint tasks. by way of contrast, "normal" B tasks mostly have ue < et, but a number of recent wu_1200_* tasks have run for over 4 hours instead of the 30 to 60 minutes required for most tasks. The application details page for my Q6600 has the average processing rate for test and B v6.57 set to 0.1826 and 10.5963 respectively. Based on this it would appear that the 2 types of B tasks are pulling the average processing rate in different directions and that significantly reducing rsc_fpops_est for wuCMreint tasks (by a factor of around 10) would improve things. ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
	ID: 16276 \| Rating: 0 \| rate: / Reply Quote

Krunchin-Keith [USA] Volunteer moderator Volunteer tester Send message Joined: Nov 10 05 Posts: 3217 Credit: 5,500,753 RAC: 3,644
	Can this be related to the cause I found this in the sched_reply_www.malariacontol.net file A 9965953386.957493 B 16585845220.902262 But it appears all workunits (A & B) use the same 21000000000000.000000 estimate value I have various estimate run times On a 12 core A is 3 hours B is 2 hours Generally these run 30 minutes but some do take 2 hours. On a 2 core A is 7 hours B is 3 hours Generally on this one they take 45 minutes with some going 2-3 hours. I think this difference in A and B causes the DCF to be varied greatly too and cause the 'high priority' a lot of people report.
	ID: 16280 \| Rating: 0 \| rate: / Reply Quote

Thyme Lawn Send message Joined: Jun 20 06 Posts: 181 Credit: 1,233,724 RAC: 1,389
	I think this difference in A and B causes the DCF to be varied greatly too and cause the 'high priority' a lot of people report. The average processing rates shown on a host's application version page should smooth out the differences between A and B (the more tasks the host has run the better the smoothing will be). High priority running is mainly caused by tasks which take significantly longer than the estimated run time. They immediately inflate DCF but it reduces at a much slower rate (by less than 10% on each successful task completion). Overnight B task wu_1200_415_12108_0_1312048809_0 had a run time of 5:22:16 on my Q6600. That bumped DCF up to over 10.293981 and caused 2 A tasks to immediately start running high priority. Tasks are now taking a lot shorter than the DCF adjusted estimates. For example, an A task has just completed with a run time of 0:54:51 causing: DCF to fall from 10.106363 to 9.197092 the estimated run time to fall from 9:06:58 to 8:17:45 for A tasks and from 5:16:04 to 4:47:56 for B tasks ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
	ID: 16282 \| Rating: 0 \| rate: / Reply Quote

hardy Volunteer moderator Project administrator Project developer Send message Joined: Feb 18 09 Posts: 141 Credit: 54,376 RAC: 129
	So bad rst_fpops_est figures bite again. Thanks for figuring this out. Well, the good news is that these CMreint workunits are nearly over (though boy, does processing 180000 results on the server take a long time)! The bad news is that we still don't have a better way of estimating work-unit run-time. It shouldn't be impossible to do better than we currently do; it's just one of those things we've not found the time to do. Sorry that I can't offer any promises of getting to this; our software group is currently downsizing.
	ID: 16287 \| Rating: 0 \| rate: / Reply Quote

John C MacAlister Send message Joined: Feb 20 11 Posts: 20 Credit: 192,407 RAC: 0
	We need a project to solve the disease of downsizing now running rampant all over the world. :(
	ID: 16288 \| Rating: 0 \| rate: / Reply Quote

mikey Send message Joined: Mar 23 07 Posts: 4382 Credit: 5,361,193 RAC: 1,084
	We need a project to solve the disease of downsizing now running rampant all over the world. :( So essentially we need to stop doing more with less? Or maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!!
	ID: 16323 \| Rating: 0 \| rate: / Reply Quote

Warped Send message Joined: Aug 1 10 Posts: 22 Credit: 207,998 RAC: 501
	... maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!! Indeed. The very fabric of society will be undermined if this alarming trend is not reversed or at least slowed. Malaria treatment regimes would also be of reduced benefit. ____________ *Warped*
	ID: 16529 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Maximum elapsed time exceeded