Errors Overnight

Message boards : Number crunching : Errors Overnight

Author Message
The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Hi

Just installed a brand new HD fully formated with a full system install. Did this because my system became unstable which caused all nine of the BOINC projects I participate in to keep failing.

However, overnight despite the full wipe and reinstall I notice there are still some errors, (listed below) that I am curious to find out if there is something wrong with:
1. the BOINC client
2. would have failed in any case
3. Some tinkering I need to do to the fresh install.

The reason I am very interested in these WU's is because it looks like they all ran long enough to complete when compared to the successful results overnight. Also when WU's run and don't error within the first few seconds it usually means there is some other reason for the failure than an incorrect setup on my machine. Rapid failures usually means there is a faulty setup on the host which is what was happening to the majority of WU's pre 27th May.

BOINC 7.05.25 installed
GPU (which should not affect this project) GTX 560Ti Driver version 285.66
Any other system information you would need to identify where the fault could be?

WU's Affected
https://malariacontrol.net/result.php?resultid=126092034
Exception: initialKappa is invalid
OpenMalaria: Domain error
08:18:20 (5372): called boinc_finish
Run time 1,649.53
CPU time 1,636.94


https://malariacontrol.net/result.php?resultid=126088423
Exception: initialKappa is invalid
OpenMalaria: Domain error
07:25:58 (3120): called boinc_finish
Run time 2,075.11
CPU time 2,058.09


https://malariacontrol.net/result.php?resultid=126088386
Exception: initialKappa is invalid
OpenMalaria: Result too large
06:59:21 (6072): called boinc_finish
Run time 1,590.45
CPU time 1,579.16


https://malariacontrol.net/result.php?resultid=126088221
Exception: initialKappa is invalid
OpenMalaria: Result too large
07:01:10 (200): called boinc_finish
Run time 2,512.83
CPU time 2,486.48


https://malariacontrol.net/result.php?resultid=126076491
Exception: initialKappa is invalid
OpenMalaria: Domain error
08:24:15 (1892): called boinc_finish
Run time 1,569.06
CPU time 1,559.59


https://malariacontrol.net/result.php?resultid=126068476
Exception: initialKappa is invalid
OpenMalaria: Domain error
02:10:22 (208): called boinc_finish
Run time 1,868.88
CPU time 1,807.47


https://malariacontrol.net/result.php?resultid=126068345
Exception: initialKappa is invalid
OpenMalaria: Result too large
01:59:42 (2284): called boinc_finish
Run time 1,758.39
CPU time 1,692.22


https://malariacontrol.net/result.php?resultid=126061994
Exception: initialKappa is invalid
OpenMalaria: Result too large
05:49:53 (5852): called boinc_finish
Run time 1,765.53
CPU time 1,751.42
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

The problem isn't on your side.

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Thanks Ananas

That puts my mind at rest :)
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Another one has turned up the same as some of the last ones:

https://malariacontrol.net/result.php?resultid=126138102
Exception: initialKappa is invalid
OpenMalaria: Result too large
17:13:46 (4272): called boinc_finish
Run time 1,749.56
CPU time 1,696.48
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Another handful of faulty results.

If I could identify which are the ones which are failing and it is possible, I'll set preferences to stop those WU's coming onto my machine as the wasted crunching time is starting to mount up. Over 9 hours in total at present on runtime.

https://malariacontrol.net/result.php?resultid=126162548
Exception: initialKappa is invalid
OpenMalaria: Result too large
23:01:30 (5588): called boinc_finish
Run time 1,977.03
CPU time 1,834.50


https://malariacontrol.net/result.php?resultid=126163225
Exception: initialKappa is invalid
OpenMalaria: Domain error
23:38:30 (5252): called boinc_finish
Run time 2,218.48
CPU time 2,063.81


https://malariacontrol.net/result.php?resultid=126163347
Exception: initialKappa is invalid
OpenMalaria: Domain error
00:31:05 (3528): called boinc_finish
Run time 2,396.42
CPU time 2,222.31


https://malariacontrol.net/result.php?resultid=126165102
Exception: initialKappa is invalid
OpenMalaria: Result too large
01:07:10 (2560): called boinc_finish
Run time 2,163.20
CPU time 2,038.84


https://malariacontrol.net/result.php?resultid=126169800
Exception: initialKappa is invalid
OpenMalaria: Result too large
01:12:16 (5112): called boinc_finish
Run time 1,610.58
CPU time 1,564.02


https://malariacontrol.net/result.php?resultid=126189087
Exception: initialKappa is invalid
OpenMalaria: Domain error
09:34:16 (3020): called boinc_finish
Run time 2,011.05
CPU time 1,800.59


https://malariacontrol.net/result.php?resultid=126214595
Exception: initialKappa is invalid
OpenMalaria: Result too large
10:46:22 (3940): called boinc_finish
Run time 3,041.52
CPU time 2,692.58


https://malariacontrol.net/result.php?resultid=126216754
Exception: initialKappa is invalid
OpenMalaria: Result too large
11:34:42 (4548): called boinc_finish
Run time 1,840.69
CPU time 1,755.44
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484



Hi Nighty,

My RAC (Recent Average Credit) used to be about 3500.

Looks like the errors started around May 22, and my RAC started turning south around May 25. Now I'm down to 2945.

I went to my Malariacontrol Account page (https://malariacontrol.net/home.php), clicked on Tasks, found a Work Unit that's listed as "Error while computing," and clicked on the "Work unit click for details" column for that erroneous Work Unit.

That opened the webpage for that particular work unit (https://malariacontrol.net/workunit.php?wuid=69251915). It shows that one of my computers worked on that Work Unit, and so did 4 other computers none of which are mine. Everyone's status is "Error while computing."

I guess it's good to see that the problem is not in our computers, but something screwy with the work units (or their validation, or...).

Yeh, lots of wasted computing time. I hope Malariacontrol quickly recognizes what's going on and gets it straightened out. I'm never going to be able to take over the world if my Work Units keep getting thrown out!

I'll check back here to see if you come up with a strategy for identifying and aborting faulty Work Units.

I've been wondering for a few months why my RAC rollercoasters between 3000 and 4000. It seemed like too much variation. I'll bet these errors have been popping up for a while.

Best luck,
-neil-
Member of team Flying Sams
Scranton, PA, where it's gray and dreary even on the sunniest of days
We have to work four times harder, because we'z only got Celerons

____________

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Hi Neil

Thanks for the support. If I manage to work out how to identify the bad WU's from the others I'll post here.

However, from this post seems we need to avoid anything that is not openMalariaA :
https://malariacontrol.net/forum_thread.php?id=1276

Going to test the theory out before my team decide on the project of the month otherwise there could be a bunch of big hitters wasting cycles.
-------------------------------------------------------------------------------

More of the Same and Some Sum's

Total Run time lost in Seconds = 35,473.81 Hours 9.85
Total CPU time lost in Seconds = 33,881.88 Hours 9.41
Cost in electricity for wasted time = about £0.54 pence.

May not seem like much in terms of cost. However, work it out over a year and that's £98.55 per annum burned for nothing.

4 of my 6 CPU's on this project gives 2.5 hours per CPU over about 45 hours which equals about 20% of the time wasted in processing these WU's.

How many other volunteers are experiencing the same?
What is the total cost across all volunteers?

Anyway that's my moan out of the way.

More of the same errors as earlier posts

https://malariacontrol.net/result.php?resultid=126219772
Exception: initialKappa is invalid
OpenMalaria: Result too large
14:49:38 (7096): called boinc_finish
Run time 4,030.39
CPU time 3,724.27


https://malariacontrol.net/result.php?resultid=126253334
Exception: initialKappa is invalid
OpenMalaria: Result too large
18:12:46 (6844): called boinc_finish
Run time 1,675.50
CPU time 1,641.94

Edited to include URL tags :)
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Hi guys,

Thanks for the info...

We're trying to find the source of the problem. It could be that one of the parameters which is automatically generated using a genetic algorithm is out of the boundaries, so the human infectivity is too low and workunits are crashing.

The automatic generation of new workunits for those particular cases have been disabled for Branch B and Test until the problem is solved... The rest is still ok.

We will let you now as soon as the problem is solved. :)

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

Hi Michael T,

Glad you're on it.

I just looked at all of my Tasks that were still available for looking at (https://malariacontrol.net/results.php?userid=57156)

The past three days, tasks between 2000 to 10,000 seconds completed. But in half the tasks, credits granted was about 15(!). The other half of the tasks crunched as long, but ended in Errors and no credit.

There hasn't been a single properly completed task for as far back as the records go.

Anyway, I hope that's old news, soon to be back to normal.

Thanks for your work. Waiting for your autobiography,

-neil-

m
Send message
Joined: May 29 08
Posts: 4
Credit: 126,517
RAC: 113

Thanks for the feedback. Had me worried for a bit!

John.

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484


Hi M.

Don't worry; be happy. Like Michael T said:

> It could be that one of the parameters which is automatically generated using a genetic algorithm is out of the boundaries, so the human infectivity is too low and workunits are crashing. The automatic generation of new workunits for those particular cases have been disabled for Branch B and Test until the problem is solved.

I'm somewhat worried about one thing, though. Michael also wrote:

> ... The rest is still ok.

I don't know what he's referring to. I can't find a single Task that has ended "ok" in the past few days.

I've stopped my few computers from downloading any more work, and aborted all the tasks they were working on and were waiting to work on.

I wish I could post an attachment to show the graphic of my Recent Average Credit tanking. Of course, the real victim is Malariacontrol progress.

Best luck,
-neil-

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Thanks MichaelT

Look forward to when the issue is resolved.

In the meantime I'll keep munching up the WU's that are available.

Kind regards
The Knighty Ni.

P.S. If there is any information you need regarding my rig let me know and I'll PM it to you.
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Hi Neil,

> I don't know what he's referring to. I can't find a single Task that has ended "ok" in the past few days.

To explain, we have several "runs" and each run generates workunits. Last thursday, we created new runs and pushed the priority to be higher than the old runs which were already there and working fine. Those runs intend to find the best parameters for our mathemical models which are used in the application.

At the end, some of the new runs, seems to have problems. We suppose that maybe we're in a bad combination of parameters space which couldn't be predicted. But we're still investigating.

We have disabled the problematic runs yesterday, so now the old runs should generate correct workunits but it could be that few of them are still in the pipe. Let me know if it keep crashing.

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

I'll be interested to see what the return on this one will be as it has already run for over 46 hours. By far the longest running WU from Malaria Control so far.

https://malariacontrol.net/result.php?resultid=126204401

It's still got another 1.5 hrs to run which will bring it up to a total of about 48 hrs runtime.

On a slower machine say at 2.66Ghz the same WU would take about 55.5 hours to complete.

Reality is I prefer longer running WU's. The longest one ever, was benchmarked at 1675 hours from another project on one of my former machines running at 2.2Ghz about 3 years ago. Nice and steady, trickle up, credit building WU :)

Oh yes. Thanks Michael. Haven't had any WU error since the 29th very early in the morning. Total erroring WU's was about 31-32 that ran for either full term or almost to full term. The last 2-3 I didn't post here as you are onto the problem.

Really appreciate what you are doing and it make me want to hang around longer to help. :)
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

Well its been about 10 days now and certainly very pleased that no more errors have been produced.

Now running around 200 + WU's daily and the only error has been one the server cancelled Lol.

So very happy about this and proves the rig is stable. Now maybe its time to start O/Cing it again back to where it was prior the rig bugging out on all projects.

Tested all of the projects over the last few days with minimal errors all round.

Thanks for taking the time to look into this MichaelT and other project staff who have helped. :)
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

Post to thread

Message boards : Number crunching : Errors Overnight


Return to malariacontrol.net main page


Copyright © 2013 africa@home