Maximum elapsed time exceeded |
Message boards : Number crunching : Maximum elapsed time exceeded
Author | Message |
---|---|
|
|
ID: 14567 | Rating: 0 | rate: /
|
|
All my Windows work units from last night and today are getting this error. All OK yesterday. |
|
ID: 14691 | Rating: 0 | rate: /
|
|
All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days. |
|
ID: 14707 | Rating: 0 | rate: /
|
|
Same here |
|
ID: 14712 | Rating: 0 | rate: /
|
|
Does anyone know if this issue is fixed? |
|
ID: 14737 | Rating: 0 | rate: /
|
|
I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue. |
|
ID: 14739 | Rating: 0 | rate: /
|
|
Now that you mention it that is also about the time I started to get the problems as well. |
|
ID: 14743 | Rating: 0 | rate: /
|
|
I seem to be sending back a large number of computationally failed WUs just lately. In fact the majority on this host, despite the fact it was working OK until the correction of the high credit per WU issue. That seems to be the same error that made a lot of my WU crash before and during the credit disaster. It's probably unrelated, as the credits are granted on server, and this happens on distributed machines. The issue doesn't seem to be fixed, though I received some that took more than one hour and didn't get it. ____________ Grüße vom Sänger |
|
ID: 14745 | Rating: 0 | rate: /
|
|
The formula to work out the maximum allowed elapsed time uses the following XML tags from client_state.xml:
Taking a task queued up on my Q6600 as an example, the formula for maximum elapsed time is
150000000000000 / 1403434822.298770 * 0.543255 = 58063.44 seconds (16:07:43)
and for estimated elapsed time it's
20000000000000 / 1403434822.298770 * 0.543255 = 7741.79 seconds (2:09:02)
(DCF is should be smaller for openMalaria branch A because most of those tasks are currently completing in between 30 and 90 minutes, but branch B tasks are taking longer than their estimated time and pushing DCF back up) ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
ID: 14748 | Rating: 0 | rate: /
|
|
All Windows work units giving this "Maximum elapsed time exceeded" error, has been now for 2 days. I too ... ____________ |
|
ID: 14752 | Rating: 0 | rate: /
|
|
Getting similar errors here too...1417 seconds seems pretty low as the max. elapsed time. Once the job hits 1417 seconds it aborts. |
|
ID: 14759 | Rating: 0 | rate: /
|
|
Uhh... I'm running Linux x86_64, but I'm still getting this error. |
|
ID: 14760 | Rating: 0 | rate: /
|
|
I may have found the problem.
150000000000000 / 50069545112.719414 = 2995.833 sec
which matches the elapsed time when the task errored (49:55 = 2995 sec). The new runtime estimation does not use DCF, as per this link: http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation The question is why does openMalariaA have such a high |
|
ID: 14762 | Rating: 0 | rate: /
|
|
The new runtime estimation does not use DCF But only if you're running an alpha test version of BOINC (6.11.* or 6.12.*). Older versions still use DCF. The question is why does openMalariaA have such a high That confirms my suspicion that something is wrong in the host_app_version table records for the hosts which are suffering from the maximum elapsed time problem. I'm not sure what could have caused that to happen, but it can only be fixed on the server. Users who are running an alpha test version of BOINC will have to wait until the hav records are modified, but those who're running 6.10.* or earlier could try increasing the DCF for MCDN in client_state.xml. ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
ID: 14764 | Rating: 0 | rate: /
|
|
As explain here I have found a work around : I may have found the problem. I manually reduce the "flops" value. - Stop boinc - edit client_state.xml AND client_state_prev.xml with the notepad editor and put a much lower value (in my case I divide per 4) - Save and restart boinc It's work now for two days. ____________ |
|
ID: 14800 | Rating: 0 | rate: /
|
|
Hi. |
|
ID: 15231 | Rating: 0 | rate: /
|
|
Hi P.P.L., |
|
ID: 15232 | Rating: 0 | rate: /
|
|
Hi michaelT. |
|
ID: 15235 | Rating: 0 | rate: /
|
|
This problem still exists also with newer tasks and is happening hundreds of times. I've seen tasks with those errors after 9 hours runtime :( |
|
ID: 15274 | Rating: 0 | rate: /
|
|
Until a WU is completed (2 valid results) we are looking to the Workunit information and we only get: suppressed pending completion. Agreed on the unfriendliness, but ... The project is using adaptive replication to make more efficient use of user CPU time. That causes the BOINC server code to suppress a workunit's details until it has been validated. The comment in the code for the test which generates the message includes the following:
so that bad guys can't tell if they have an unreplicated job
i.e. the vast majority of users are inconvenienced because a few people will try anything to inflate their credits :( ____________ "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
ID: 15276 | Rating: 0 | rate: /
|
|
We increased the rsc_fpops_bound for some of the jobs (not all) but it seems that other jobs had also exceed the rsc_fpops_bound. So after your message have been posted, the rsc_fpops_bound have been changed for the rest of the jobs and now everyting is back to normal. :) For the second problem, I think Thyme Lawn answered you ( @Thyme Lawn : thanks :) ) |
|
ID: 15355 | Rating: 0 | rate: /
|
|
I have 2 computers (1090T & 2600K), both with 64 bit ubuntu 11.4 linux as the OS. |
|
ID: 16246 | Rating: 0 | rate: /
|
|
Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate. |
|
ID: 16247 | Rating: 0 | rate: /
|
|
Looks like this happens most on your Core i7-2600K CPU. You running this with 8 simultaeneous tasks at once? That's probably the way to get the most work out of the CPU, but performance per job will be significantly lower than when only one is run since (a) hyperthreading shares processing resources between multiple threads and (b) the CPU clocks itself higher when only 1 or 2 cores are in use. I suspect the BOINC CPU benchmark runs as a single-threaded benchmark, which means actual performance processing work will be lower than expected (at least initially for each app version) and thus BOINC underestimates how long the workunits need to run. Probably we should increase the cut-off limit to compensate. I'm pretty sure I disabled the "Turbo boost" mode in the bios but I'll check the next time I reboot. All 8 threads are running malaria and one other DC project. I did clock back both machines a month or so back due to the hot summer weather. I ran the Boinc CPU benchmarks yesterday just to make sure they are in line with the lower clock settings. |
|
ID: 16249 | Rating: 0 | rate: /
|
|
I check the "top hosts" page and most of the top computers are having the same problem/error to varying degrees. Here's one for example. |
|
ID: 16250 | Rating: 0 | rate: /
|
|
Well, I've increased the limit (will only affect new workunits). It seems a bit odd if you compare the host's FPOps/sec times run-time with the bound (was 3*10^14, now 1.5*10^15), but lets see whether this increase solves the problem before doing anything else. |
|
ID: 16252 | Rating: 0 | rate: /
|
|
Same thing here,started about a week ago after crunching a shed load of the small CMerit Wus. |
|
ID: 16259 | Rating: 0 | rate: /
|
|
Something has caused Chris's average processing rate for openMalariaB v6.57 to explode up to just over 310 (see here).
|
|
ID: 16276 | Rating: 0 | rate: /
|
|
Can this be related to the cause |
|
ID: 16280 | Rating: 0 | rate: /
|
|
I think this difference in A and B causes the DCF to be varied greatly too and cause the 'high priority' a lot of people report. The average processing rates shown on a host's application version page should smooth out the differences between A and B (the more tasks the host has run the better the smoothing will be). High priority running is mainly caused by tasks which take significantly longer than the estimated run time. They immediately inflate DCF but it reduces at a much slower rate (by less than 10% on each successful task completion). Overnight B task wu_1200_415_12108_0_1312048809_0 had a run time of 5:22:16 on my Q6600. That bumped DCF up to over 10.293981 and caused 2 A tasks to immediately start running high priority. Tasks are now taking a lot shorter than the DCF adjusted estimates. For example, an A task has just completed with a run time of 0:54:51 causing:
|
|
ID: 16282 | Rating: 0 | rate: /
|
|
So bad rst_fpops_est figures bite again. Thanks for figuring this out. Well, the good news is that these CMreint workunits are nearly over (though boy, does processing 180000 results on the server take a long time)! The bad news is that we still don't have a better way of estimating work-unit run-time. It shouldn't be impossible to do better than we currently do; it's just one of those things we've not found the time to do. Sorry that I can't offer any promises of getting to this; our software group is currently downsizing. |
|
ID: 16287 | Rating: 0 | rate: /
|
|
We need a project to solve the disease of downsizing now running rampant all over the world. :( |
|
ID: 16288 | Rating: 0 | rate: /
|
|
We need a project to solve the disease of downsizing now running rampant all over the world. :( So essentially we need to stop doing more with less? Or maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!! |
|
ID: 16323 | Rating: 0 | rate: /
|
|
... maybe we need to help companies figure out how to G-R-O-W so they not only retain the people they have but hire M-O-R-E of them!! Indeed. The very fabric of society will be undermined if this alarming trend is not reversed or at least slowed. Malaria treatment regimes would also be of reduced benefit. ____________ Warped |
|
ID: 16529 | Rating: 0 | rate: /
|
|
Message boards : Number crunching : Maximum elapsed time exceeded