Long Run Times

Message boards : Number crunching : Long Run Times

Author Message
Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,404
RAC: 2,632

I've currently got some wu's that have very long runs times. The estimated time to completion for ready to start wu's is at 12hrs (tylically take 1 to 3 hrs) with one running wu at 55% after 6.75hrs and one at 15% after 5.75hrs (task manager shows each wu with 50% of the CPU on this dual core CPU).

Is this an intentional change to the wu's?

Bill Hepburn
Send message
Joined: Jul 14 06
Posts: 2
Credit: 2,178,969
RAC: 4,449

I got a couple of those too. I didn't notice them until they had run for over a day to get to 25% completion and went to high priority mode. I have also gotten about a dozen W.U.s that have errored out in the past couple of days. Both of these have not happened before with any regularity. Something is going on with the W.U. generation. I set "no new work" for a couple of days, or until the project announces it's fixed.
____________

Rutor
Send message
Joined: Jan 28 11
Posts: 1
Credit: 100,795
RAC: 0

And after 33 hours of computing without error, you are rewarded with 0.00 credits.

wuid=69324916

Please fix this issue.

Regards,
Rutor

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484


It's hard to know how much human intervention and oversight is involved at Malariacontrol Central. Maybe things are left to run automatically, and it could be weeks before anyone even starts troubleshooting the problem.

I hope it doesn't unfold like that.

It's not hopeful when the latest "News" on the Malariacontrol.net home page is from 20 Dec 2010. Communications is key. It will be interesting to get some.

At the least, I would rather the work units dry up, let my processors have a vacation, and stop wasting my electricity rather than have faulty work units running as normal and then get thrown out.

But I don't think I can bring myself to turn off my computers, myself. The project probably needs our computers to provide feedback to see if things are back to normal, yet, maybe now, maybe now.

Best luck,
Neil / Scranton, PA, where it's gray and dreary even on the sunniest of days

Pavel K
Send message
Joined: Mar 25 12
Posts: 1
Credit: 1,384,942
RAC: 1

You could find some updates on message board.
I aborted 2952 WU as they took a long time (>40hr). as well as some of them finished in computation error.
here is still no response about reasons that happened. so i will abort 3153 and 2952 until adequate response will appear.
____________

Profile p3d-cluster
Send message
Joined: Aug 1 08
Posts: 1
Credit: 3,608,274
RAC: 22,641

Very long runtime and no credits ????

https://malariacontrol.net/result.php?resultid=126005469

Thats not ok.


jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

You van add 2953 to the list of absurd running times.
I have some that run over 50 hours on my 2600K.
I've seen runtimes of over 100 hours on slower machines where the expiration time has obviously expired.

I had noticed a giant drop in points per day from my machines due to these.

No points for such WU's is not OK, I'll be giving my CPU's a vacation on a different project. I'll be back when it's sorted. Not going to babysit machines to not start such tasks.

I am even finding WU's that have been turned in wel before the expiration time and still retun 0 credits for 30+ hours of work.
(Someone from my team sugested it might had been due to expiration time that no points had been granted, so that is not the only issue. Expiration obviously is an issue with such runningtimes, though that's probably due to faulty wu's rather then an intentional change.)

Anyways,
we'll see if someone from malaria is reading this,
or will read this when the drop in donated computation time is noticed.
As I can't see why anyone would stay when many tasks are impossible to turn in on time for most CPU's.

It's lasting for more then 2 days now, how slow will the first response be?

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Small update:
not all long running units are returned without credits.

The question is why did that one get credits and do so many others not?

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Regarding the non credit issue the problem :

It seems that this is due to some issues with the validator : there is a MAX_GRANTED_CREDIT parameter which should in theory grant MAX_GRANTED_CREDIT (it avoids cheating with high credit request) if WU_CREDIT > MAX_GRANTED_CREDIT but in our case it granted 0 credit ... :(

Some of the 0 granted workunits have already been purged but we manage to get all the hosts and the average credits for all those ones. So for the one who didn't get credit before it's fixed now.

We increased the MAX_GRANTED_CREDIT like that this should not be a problem anymore. But let me know if it happen again.
____________
Michael Tarantino
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Thank you for your reply!

But what about the extreme runtimes of 100+ hours?
WU's are to be returned within 3 days, that's not possible when they run so extremely long.
If the expiration date is also set to an extend longer that matches the length of the wu's then there is no problem.

With WU's of tops 5 hours (not so fast PC) it was 3 days, now I am getting 100+ hours wich would be over 20 times longer.
Will the expiration date be set accordingly? 60 days would be enormous,
then again, some WU's would run 5 days instead of 5 hours.
So 10 days would be minimum if there would only be 1 WU in queue.

Sometimes I get 50 in queue on a machine that is set to fetch work for an aditional 0.2 days.
If those tasks then take 60 hours each on 8 threads,
that would take 16 days to complete.
So a minimum safe time would apear to be around 20 days.
What happens when not running 24/7 is then still the question.

Currently I stopped running malariacontrol due to the fact that it's not possible for many of my PC's to return tasks in time (even without queue).
I would like to restart as soon as I can tell that my PC's are able to return them well before the deadline.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Just to link a task that had run for 106 hours and was at 90% when I aborted it:
(it still shows computing time)

That task was obviously overdue. Good luck for the next person that tries to crunch it.

Trotador
Send message
Joined: Aug 5 09
Posts: 1
Credit: 769,743
RAC: 3,626

This one shows credit 0 as well. Many others are being valued.

https://malariacontrol.net/result.php?resultid=126341532

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

Hi JDVB,

Thank you for doing such a good job of completely summarizing the problems that our computers are running into. Your summary will help assure that MichaelT won't miss any issues as he's working on straightening out the new work units.

I also saw super-long work units over here that were never going to finish on time. When I aborted them, they were replaced with new work units that were at least 5 time longer normal. That's when I purged all my Boincs.

I'll be checking these forums every day for word when it's ok to jump back in.

It's paradoxical: I think it was just May 1 when my Recent Average Credit hit its highest point ever at 4000. There were some wild swings in RAC before and after that were making me wonder. I think the troublesome work units starting coming out well before May 22, after which they completely took over.

Best luck,
-neil-

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

This one shows credit 0 as well. Many others are being valued.
https://malariacontrol.net/result.php?resultid=126341532
New tasks will be given credit, the remaining point is that there are tasks out there lasting much much longer then the deadline to turn them in.
That basically puts the system requirement to return tasks before the deadline so high that only recent highend CPU's can join.
See the 100+ hour WU I posted earlier, where the hard deadline is 72 hours.

Hi JDVB,

Thank you for doing such a good job of completely summarizing the problems that our computers are running into. Your summary will help assure that MichaelT won't miss any issues as he's working on straightening out the new work units.

I just hope they will stick out between posts where people only complain about the part that is already solved. Being I want my credits instead of wanting to be able to turn all WU's in, before the deadline hits.

Currently I only run malaria on one PC, an i7 as that one is able to process those WU's within 72 hours still.

Heaven forbid, you'd have a PC not running 24/7 but instead 3 hours a day. With tasks running 120 hours that'd be 40 days for 1 task! Ohw wait, I have one such PC!

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Wheey, now I got a nice WU on an i7 2600K.
But sadly that CPU is not fast enough to report tasks before the deadline.

I am talking about this task.

The time between date send and the deadline is 83 hours and 20 minutes.
The current CPU time is 17:30 with 21.185% completed.

So 17.5/21.185*100=83 hours.
That is getting awefully close to that limit!

What are the system requirements to run malariacontrol?
And I guess that it's manditory that PC's run 24/7?

Meh, and I was thinking a 2600K was fast! You guys obviously think something else!

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

Wheey, now I got a nice WU on an i7 2600K.
But sadly that CPU is not fast enough to report tasks before the deadline.

I am talking about this task.

The time between date send and the deadline is 83 hours and 20 minutes.
The current CPU time is 17:30 with 21.185% completed.

So 17.5/21.185*100=83 hours.
That is getting awefully close to that limit!

What are the system requirements to run malariacontrol?
And I guess that it's manditory that PC's run 24/7?

Meh, and I was thinking a 2600K was fast! You guys obviously think something else!


You can probably abort those units as Malaria is now only sending out the older type units again. My laptop, a T2300 running at 1.67GHZ and only using 1 core to crunch with, is now returning units in about 1.0 to 1.5 hours.

And NO it is not mandatory to run your pc 24/7, you WILL get more work done if you do, but as long as you get units back by their deadline it is okay.

ps I am ONLY running the A type units.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

You can probably abort those units as Malaria is now only sending out the older type units again. My laptop, a T2300 running at 1.67GHZ and only using 1 core to crunch with, is now returning units in about 1.0 to 1.5 hours.
I had purged all work and then though, well, it's a fast PC so it will most likely still get all jobs done before the deadline passes so lets start that one up again.
Then it went and got a stunning 61 new tasks of which there are 4 that run longer then 2 hours. 1 of those 4 is the one I mentioned, the others will most probably be in time for the deadline.
So the odds of getting such WU's was at the time I fetched WU's a 4 out of 61 chanche. The other 4 threads managed to eat the rest of the 57 WU's just fine so will be getting new werk for those threads shortly.

So I am not certain that no more of such ridiculous jobs are comming, besides I think it's a waste of CPU time to abort when I could perhaps still turn them in before the deadline. So I will only abort when it's obvious that it can't complete before the deadline.
That said the mentioned job is slowing down, after 18.5 hours only 21.9% -->84.5 hours to complete. That is longer then the alowed time and likely to grow above 100 hours as the task is now doing less then 1% in an hour.
(100-21.9)/(21.9-21.185)+18.5=127.7 hours based on the progress of the last hour.

It's just such a shame, 18.5 hours of CPU time down the drain.
If I happen to get more like that, then I'll just stop malariacontrol all together as I am not planning to computersit and see if mc would else be waisting multiple days of tasks that can never be finished before the deadline.

update:
tasks:
126501179
126502016
can't be turned in before the deadline either (progress at around 10% after 10 hours). These tasks I got around midnight. I doubt it's fixed. Waiting for the next.

126466406 is the question, it's now at 40.7% but not going so fast.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

The long workunits generation have been disabled and the deadline is now set to 6 days until the long workunits batch is over. We're still investigating on the problematic. It's possible that a few of them are still in the pipe but the situation is now almost back to normal.

Usually for the same type of workunit it took 10 to 20 times less cpu_time according to our previous experiences, that's why the deadline was set to 3 days. It appear that we're in a really bad combination of parameter space but this is difficult to predict due to the complexity of the mathematical model implemented and also the stochasticity.

blckgrffn
Send message
Joined: Nov 14 11
Posts: 1
Credit: 1,001,305
RAC: 0

It would be nice if this was posted to the front page. These forums are so infrequently updated that I didn't even think to check here for a while. I've moved all but one rig to other projects in the duration.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Thank you!,

I had indeed received another 4 of these long running tasks, sadly still with the 3 day limit.
I have aborted those but would love to get them again with the 6 day limit to see the points they'll generate!
Once those few are out of the pipe I'll move my PC's back that don't run 24/7.

I guess that once the deadline is again set to 3 days, that there will no longer be long WU's so that will be when I'll go full on malariacontrol again.

Robert Johnson
Send message
Joined: Mar 28 08
Posts: 3
Credit: 831,139
RAC: 490

Turns out I am running two of those work units-
69243256 - 54 hours so far 69% done 22 hours left
69329138 - 54 hours so far 66% done 26 hours left

cheers,

Robert Johnson
Send message
Joined: Mar 28 08
Posts: 3
Credit: 831,139
RAC: 490

more info -
also had one a while back 69339921

Paul
Send message
Joined: Aug 17 10
Posts: 7
Credit: 481,674
RAC: 0

The long times aren't consistent either - witness this workunit which had three runs, two of which were abnormally long and one of which was two hours for the same workunit. It wouldn't surprise me to see that the Linux client was faster than the Windows one, but eight times faster?

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484


Dear Malaria bugs,

On the Malariacontrol website, I checked my list of "Tasks" which goes back about three days.

All my work units have been between 1000 and 20000 seconds (between 16 minutes and 6 hours).

For the 20000 second work unit, I got Credit = "64."

I assume that's 64 seconds of credit?

For the shorter work units, I got less credit, I assume proportionally, also in the tens-of-seconds.

If I'm only getting 1/3-of-a-percent of the credit that I'm crunching, then I'm going to keep running only 1 out of 7 processors until things are fixed.

If someone thinks I'm misinterpreting the Credits column, please let me know, thanks.

Even jdvb ("Have i7 Will Travel") mentions work unit https://malariacontrol.net/workunit.php?wuid=69400915
It shows thousands of seconds of work, but only tens of seconds credited.

Unless I'm mistaken about getting a minute's worth of credit for 20,000 seconds of work, then I think we have another big problem, in addition to getting work units that expire too quickly.

Don't crunch too hard,
-neil-

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484



Oh, and my list of "Pending Tasks," which is usually empty, if filled to the rim.

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084


All my work units have been between 1000 and 20000 seconds (between 16 minutes and 6 hours).

For the 20000 second work unit, I got Credit = "64."

I assume that's 64 seconds of credit?

For the shorter work units, I got less credit, I assume proportionally, also in the tens-of-seconds.

It shows thousands of seconds of work, but only tens of seconds credited.

Unless I'm mistaken about getting a minute's worth of credit for 20,000 seconds of work, then I think we have another big problem, in addition to getting work units that expire too quickly.

Don't crunch too hard,
-neil-


We get "cobblestones" not "seconds of credit" and the two are NOT related! You get credit based on how HARD your pc works, not on how long it takes, the faster pc's can do more work in the same amount of time so get more credits per hour/day/month/etc. While working 'harder' can equate to working 'longer', it is not an automatic relationship. Read this thread for an idea of what we are doing in each unit:
https://malariacontrol.net/forum_thread.php?id=1264

Strat
Send message
Joined: Apr 8 12
Posts: 6
Credit: 10,841
RAC: 0

So I have another work unit taking ages, 57hrs elapsed and +52hrs remaining with that clock extending outward. This and another half the size were among series that was only taking hrs to complete. Whats going on? Is it going to be worthless?

So I just checked my history 7000s computing and zero credit + 250,000s computing and zero credit. If you like I will just kill the running job at 51 hrs and you know where you can shove your malaria net!

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

So the units for the Credit column is in Cobblestones, not Seconds? (I recommend data units should be labeled.) OK, that's half an answer to my question.

I could use a little more confirmation regarding having achieved 16 Cobblestones of Credit for 20000 seconds of CPU Time, and all the rest of my successfully completed work units are of similar ratio. Does that sound approximately like what you'd expect from a Core-2? If so, I'll assume we're back to normal and I'll turn on the rest of my processing threads.

Or is the credit off by a few magnitudes from what's expected?

Thanks,
-neil-

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

So the units for the Credit column is in Cobblestones, not Seconds? (I recommend data units should be labeled.) OK, that's half an answer to my question.

I could use a little more confirmation regarding having achieved 16 Cobblestones of Credit for 20000 seconds of CPU Time, and all the rest of my successfully completed work units are of similar ratio. Does that sound approximately like what you'd expect from a Core-2? If so, I'll assume we're back to normal and I'll turn on the rest of my processing threads.

Or is the credit off by a few magnitudes from what's expected?

Thanks,
-neil-


I JUST checked and this is the latest one of yours to be validated:
Completed and validated 4,993.80 4,868.27 17.69

And yes it is pretty typical of how it works here at Malaria the credits are pretty low compared to some other projects. As far as 20,000 seconds for one unit and you getting 16 credits I don't see that unit any more, it could have been purged from view already. Most of yours and my units finish in the 2 to 5 thousand second range, the unit above is 4,993.80 seconds, or around 83 minutes.

Strat
Send message
Joined: Apr 8 12
Posts: 6
Credit: 10,841
RAC: 0

So I guess these are the last jobs I do for Malariacontol.net seeing how I got ZERO credit.


Name wu_2952_24_741957_0_1337959442_2
Workunit 69267532
Created 30 May 2012 23:24:01 UTC
Sent 30 May 2012 23:26:25 UTC
Received 2 Jun 2012 21:45:32 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -177 (0xffffff4f)
Computer ID 554761
Report deadline 3 Jun 2012 10:46:25 UTC
Run time 240,684.84
CPU time 226,948.60
Validate state Invalid
Credit 0.00
Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58


Name wu_2953_232_743346_0_1338056246_2
Workunit 69362442
Created 30 May 2012 5:54:35 UTC
Sent 30 May 2012 6:01:51 UTC
Received 2 Jun 2012 10:03:54 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -177 (0xffffff4f)
Computer ID 554761
Report deadline 2 Jun 2012 17:21:51 UTC
Run time 249,103.84
CPU time 230,974.40
Validate state Invalid
Credit 0.00
Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,376
RAC: 129

Strat, you might want to read michaelT's posts above, with regards to the credits. I think he's fixing things in the DB (i.e. for past work units). The accreditation is a bit too complicated not to have something go wrong once in a while, but we try to fix things up.

As to your question Neil, a lot of the work unit generation and processing is indeed automatic. But starting new runs like the ones which turned out problematic is not, and we definitely keep an eye on them!

Strat
Send message
Joined: Apr 8 12
Posts: 6
Credit: 10,841
RAC: 0

Well I will wait and see if I get credit but until then the tasks are emptied and malariacontrol.net is suspended on my system. If you can understand my position, I have done tasks for climateprediction net that run for months then error, they have no problem with giving credit. These tasks for malariacontrol net were extremely under quoted in their size and on very short notice running at high priority when I actually wanted all the processors running for my team on edges@home from 00 UTC 1st June. So I let them run their course even though they were interfering with what I wanted to do. Then I get 0 credit! So I say I am justifiably pissed! I chose malariacontrol net because it seemed to be doing something important and practical. I don't question what the task is or why it’s urgent, get upset if it errors or question the amount of credit, but when you get ZERO you wonder if it’s just a waste of time.

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

The error on your box has been "Maximum elapsed time exceeded", which is exactly what I have been afraid of when I started that other thread : I just hope that no rsc_fpops_bound setting will kill it.

It is a workunit setup problem, not a program failure on the hosts.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

There is still a problem with the deadlines.
The WU's that run so long still have 3 day deadlines while the ones that run short got 6 day deadlines.
I have another six of those, yet again waisted CPU time due to not being able to turn these in before the deadline.
40+ hours computed with 50% or less completed. Can't turn those in within 72 hours so will once again abort.

It's about these WU's:
https://malariacontrol.net/result.php?resultid=126908865
https://malariacontrol.net/result.php?resultid=126763475
https://malariacontrol.net/result.php?resultid=126755867
https://malariacontrol.net/result.php?resultid=126705067
https://malariacontrol.net/result.php?resultid=126685196
https://malariacontrol.net/result.php?resultid=126683054

Please just remove all of those from the queue, it's getting tiresome to keep running into such WU's. It's once again waisting 100's of hours of CPU time better spend elsewhere.

Also, please keep this thread on topic, it's about the length of WU's, not about credits. And sadly the length of mentioned WU's does not fit in the deadline.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

The long running tasks caused my Q6600 to run MCDN to the exclusion of everything else for 3 days. Not a problem in the long term as BOINC will allow the other projects to make up their shortfall.

Initially it looked like everything would be returned on time as the wu_2952_* task which triggered it took 13 hours and there were only 16 of those tasks and 15 normal length ones to complete within 4 days. After a couple of tasks took 16 hours the available time started looking a bit tight.

Then I spotted that 2 tasks were running a lot slower (one at 1% per hour, the other significantly worse than that). I manually suspended them to give the other tasks a chance to meet their deadlines. I also manually increased to make sure none of the tasks would be timed out.

Some of the remaining tasks were long duration (5 at between 14 and 16 hours and 1 at 54 hours). My pre-emptive action meant that, other than the 2 I'd suspended, only one of the 14 hour tasks missed it's deadline (by 2.5 hours and the WU obviously had a quorum of 1 as it was immediately validated).

So what of the 2 suspended tasks?


  • wu_2952_29_745368_0_1338233405_0 was returned yesterday with slightly over 4 days of run time, a bit over 2 days after its deadline. That task must have a quorum > 1 as its status is "Completed, waiting for validation". I doubt if it will be validated (it's very unlikely the WU will have a second completed task).

  • The other task (wu_2952_35_745368_0_1338233405_0) is still running in high priority mode (currently at 77% after 134 hours run time with BOINC Manager projecting 40 hours to completion).


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

And sadly the length of mentioned WU's does not fit in the deadline.

Running beyond the deadline is only a problem in 2 cases:

  1. if the task hits the limit it will be timed out by your BOINC client (exit status -177). That limit can be worked around by stopping BOINC, manually increasing the value for the affected workunits in the client_state.xml file and restarting BOINC.

  2. if other tasks from the same workunit are completed and validated a task returned after its deadline won't receive any credit.


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

wu_2952_29_745368_0_1338233405_0 was returned yesterday with slightly over 4 days of run time, a bit over 2 days after its deadline. That task must have a quorum > 1 as its status is "Completed, waiting for validation". I doubt if it will be validated (it's very unlikely the WU will have a second completed task).

The status for that one has now changed to "Completed, can't validate", but strangely the workunit still has "Tasks in progress" set to "suppressed pending completion". C'est la vie.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

Task: 126247327
Work unit: 69482324
Computer: 200864 (Core2 CPU 6600 @ 2.40GHz)
Sent: 28 May 2012 14:54:25 UTC
Time reported: 29 May 2012 9:25:44 UTC
Status: Completed, can't validate
Run time (sec): 54,942.67
CPU time (sec): 54,873.67
Credit: 0.00
Application: openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58

Well, one core wasted 15 hours. There have probably been bigger wastes since May 28, and I hope they'll be the last.

The last task sent to me that wound up on my Pending list ("waiting for validation") was sent June 1.

Starting June 2, I've been sent 140 additional tasks, and every one of them "Completed and validated" except for 28 that are still running. No Pendings, Invalids, or Errors. That's better.

The tasks all seem to be running from 1 to 3 hours. Most start off with my client saying they'll take 17 hours, and then they whittle down to a few hours by the time they end.

Best luck,
-n-

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

The other task (wu_2952_35_745368_0_1338233405_0) is still running in high priority mode (currently at 77% after 134 hours run time with BOINC Manager projecting 40 hours to completion).

That one was sent a project abort after running for another 8 hours. 2 tasks (still to be reported) which had been running on my laptop for 49 hours have just suffered the same fate. That's a total of 2 weeks of unproductive run time for 4 tasks.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

That one was sent a project abort
That would be a good thing!
If MC has done that to all long running WU's that would mean things would be back to normal! And normal is good!

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

Normal is good!

The wheels of business grind slowly.

------

My Recent Average Credit is climbing about 100 per day, from last week's low of 1800 back to 4000 where it belongs.

"Pendings" down to 13 Tasks, from a high of about 40. I think I would have to keep better records offline to keep track of which Tasks were converted to either Valid or Errors, or if they just faded away without explanation [regarding what happened to the thousands of lost hours].

Moot question. It's a transient situation, returning to normal.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Update:
I now have another long running task, however, this one does have a longer deadline. 16 hours past and progress at 35%, so should finish in time.

TylerChris
Send message
Joined: Mar 29 07
Posts: 23
Credit: 513,393
RAC: 2

Yup the 2952** series of WUs are back .
This one
took the best part of 17 hours to complete on a duo running at2.66.
Validated ok the check points are very frequent so not much time lost suspending,
but would rather not get em.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Thanks for the info. To prevent reach previous situation, I've increased the flops_estimated and the deadline. :)

The parameters should now be ok, less than 5% of the task have a elapsed time of 6hours, but I'll keep an eye on it and do additional testes.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

My laptop completed 2 wu_2952 tasks overnight, both with elapsed time >37 hours (wu_2952_233_760582_0_1339482170_0 and wu_2952_24_760582_0_1339482167_0). Two wu_2952 tasks from the same batch were completed a couple of days ago in around 4 hours. Prior to that Branch A tasks had (with the exception of a wu_2899 task which took 7 hours) been completing in under 2 hours.

That system has 24 queued MCDN tasks with a deadline of less than 5 days, including 12 wu_2952 tasks (all downloaded 9 hours before Michael posted about the modified parameters). I'm hoping most of them will have elapsed times at the lower end of the range, but I've taken the precaution of suspending the wu_2952 tasks to ensure the others are completed before deadline.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

The long running task I mentioned lasted 101 hours and sadly returned 0 credit.

But at least the deadline now alowed to finish it, the next bit would be to get credited for such tasks.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Problemed solved and credit granted. :)

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Problemed solved and credit granted. :)
Super, thank you!

LogPile
Send message
Joined: Nov 16 07
Posts: 33
Credit: 1,205,314
RAC: 1,401

wu_2953_34_761971_0_1339591282_0

To date 64hrs 47% giving projected 136hrs total run time which will be outside the 6 day deadline.

Continue or not that is the question?

Paul
Send message
Joined: Aug 17 10
Posts: 7
Credit: 481,674
RAC: 0


Continue or not that is the question?


And did you? I ask because two more 2953 jobs have landed with me. Mind you, these seem to be going significantly faster than the one that timed out last time.

Edit: Just read rest of thread...looks like it's worth leaving it run for now.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

I now have anotherone, wu_2953_29_761674_0_1339566490_1, 44.3% done after 50 hours.
Thats less then 1 percent per hour, and as all such task slow down a bit I estimate it will finish after about 140 hours. That is on an i7.

Given its average credit / 8 threads / 24 hours * 140 duration that should be over 4K credits for one task!

The previous machine that had a long running one was quite a bit slower and had completed in 101 hours, that was not within the creditlimit while that task was worth only about 1.5K credits.

The deadline of that task is nearly 7 days, so it should be able to complete before then if it does not slow down too much.

That said, slower machines, or computers that don't run 24/7 will not be able to complete such tasks in time. I have computers that run only 8 hours a day from monday till friday.
Computers that are much slower so would require like 240 hours to complete such tasks, and only run for 40 hours a week.
That means to be able to complete such tasks on such a computer would require a deadline of 6 weeks, or 42 days.

I do hope such tasks will be split up in multiple smaller ones, else there won'tbe many computers that meet the minimum requirements to complete them.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Hmm, I have more then one such tasks it appears.
One of those is not going to complete as it's on a pc that is not running 24/7 but only for a few hours a day.
It would require a deadline of about a month.

I guess were going back to only PC's on malaria that can deal with such WU's :-(

LogPile
Send message
Joined: Nov 16 07
Posts: 33
Credit: 1,205,314
RAC: 1,401


Continue or not that is the question?


And did you? I ask because two more 2953 jobs have landed with me. Mind you, these seem to be going significantly faster than the one that timed out last time.

Edit: Just read rest of thread...looks like it's worth leaving it run for now.



It crashed (Exit status -177 (0xffffff4f)) after c140hrs c60%

Paul
Send message
Joined: Aug 17 10
Posts: 7
Credit: 481,674
RAC: 0

Mine too. Still giving an error of maximum runtime exceeded.

Phil Lancaster
Send message
Joined: Sep 30 08
Posts: 1
Credit: 5,325,989
RAC: 5,958

I have just had a couple of long run time units finish:

wu_2953_30_760698_0_1339490489_2 after 53 CPU hrs and wu_2953_34_762190_0_1339609293_2 after 103 CPU hrs

both finished with Compute Error and exit status -177

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Mine also exit with maximum time executed error.
Something needs to be done again.

Warped
Avatar
Send message
Joined: Aug 1 10
Posts: 22
Credit: 207,998
RAC: 501

Over the weekend I had one of the 2953 work units which completed quite quickly (about 30 hours).

I have another one which is going slower and slower and now has no hope of completing within the deadline and will likely go to about 150 hours. Given the likely error, I'm aborting it.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

I have double the maximum cpu time (rsc_fpops_bound) like that it should not reach the limit anymore ... and now that we have enough results to process, the 2952 and 2953 series has been disabled until we find a reason of so much differences between workunits of the same series.

Paul
Send message
Joined: Aug 17 10
Posts: 7
Credit: 481,674
RAC: 0

Well, I finally had one of the big ones finish (~170 hours) :-)

wu_2953_34_761527_0_1339554310_2

But got no credit for it. :-( OK, I know the credit's just a bit of fun but it does leave me wondering whether I've been running my computer for a week and the result is just binned...

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,404
RAC: 2,632

I recently had one series 2953 finish OK and received over 2000 cs for it, which was about right based on the cpu time. I had one error out on the weekend after 91 hrs of crunching.

Link

Bugger is all I can say!

Paul.

Post to thread

Message boards : Number crunching : Long Run Times


Return to malariacontrol.net main page


Copyright © 2013 africa@home