New version 6.12 of the malariacontrol science application ready for testing

Message boards : Number crunching : New version 6.12 of the malariacontrol science application ready for testing

Author Message
Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Just a short post to announce that the new version of the malariacontrol science application is now ready for testing.
This is a C++ implementation of the previous Fortran version, and we will post a few more details soon.

We will start running test workunits shortly. This means that you need to allow work for test applications in your project preferences as described here.

Please report any problems in this thread.

Thanks for your patience!
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Ageless
Avatar
Send message
Joined: Jun 29 06
Posts: 261
Credit: 149,220
RAC: 17

Less than 2 days deadline? Ouch, best not ask too much work then. They come in at an estimated hour and a half, but I suspect they'll take a bit longer than that.

Good to see this project back. :-)

They do not have graphics, is that correct? Or do they not have graphics on BOINC 6? I don't see a separate graphics application running in Task Manager.
____________
Jord.

BOINC FAQ Service

Chris Sutton
Send message
Joined: Nov 10 05
Posts: 297
Credit: 4,941,683
RAC: 0

Welcome back Nick (and MCDN :)

I have an old linux box that requires a static compiled binary, but the current test binary requires the libstdc++.so.6 shared library. Unfortunately I've burned through a bunch of test wu's in the interim...

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi.

Just on the deadlines, mine are all one(1) day and have gone to high priority.

Not real good.

btw\ good to see you all back.

pete.
____________

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Thanks for the feedback! I've extended the deadline (for workunits created from now on), and static linking is on the list for the next version.
You're right, that graphics app (now separate) is not yet included.
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2942
Credit: 926,890
RAC: 1,261

Initial estimate for me was 1.5 hrs, but after 2hrs 40min. I've still got 1hr 35 min. left to go (ha ha!). [edit]Says it's 51% complete.[/edit]

This is on a Win XP PRO Athlon 2600+ system.

Ageless
Avatar
Send message
Joined: Jun 29 06
Posts: 261
Credit: 149,220
RAC: 17

Initial estimate for me was 1.5 hrs

Same estimate on my Win XP Pro AMD XP2200+, but they go in 3h 15 minutes. Still, since I had 10 of them waiting, I aborted some of them, so not to drive the debt up too high for Malaria. Plus it's less than 24 hours to go on the deadline, so they would never be able to all be done before their deadline. ;-)
____________
Jord.

BOINC FAQ Service

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2942
Credit: 926,890
RAC: 1,261

Initial estimate for me was 1.5 hrs

Same estimate on my Win XP Pro AMD XP2200+, but they go in 3h 15 minutes. Still, since I had 10 of them waiting, I aborted some of them, so not to drive the debt up too high for Malaria. Plus it's less than 24 hours to go on the deadline, so they would never be able to all be done before their deadline. ;-)


My first WU is now 93% complete and it looks like it will be 5hrs per WU. That means with the 30hr deadline allotted, I can only return 6 (MAYBE).

Not sure how your system is so much faster than my 2600 (however, it isn't a dedicated cruncher either). Perhaps crunch times per WU are somewhat variable.

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi.

My first one only took 50min on my Quad, we'll see how the rest go!

pete.

____________

Chipotle
Send message
Joined: Dec 24 07
Posts: 2
Credit: 260,798
RAC: 245

My first two are taking 6-7 hours each on a Core 2 Duo (an order of magnitude longer than Peter Leman's quad). I sure hope there's some variance in the WU runtimes to account for this.

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi Chipotle.

I guess there is different length tasks, my P4 3.0 took over an hour for one but

my Quad finished another in a bit over 34min, not OC'd at all on either one.

Both running Ubuntu btw. :)

pete.

____________

Chipotle
Send message
Joined: Dec 24 07
Posts: 2
Credit: 260,798
RAC: 245

Hi Pete.

My run times are all consistently longer, several hours. Most of them are P4s of similar speed to yours. My machines are running Windoze :(

Dotsch
Avatar
Send message
Joined: Jun 21 06
Posts: 65
Credit: 35,926
RAC: 5

At MacOS 10.5.x on Intel with BOINC client 6.2.18 the CPU throtteling did not work at some stages. At some stages the CPU jumps to 100% and at the same time the disk io increases to about 2..3 MB/s. The duration is about 15 sec, repating every several minutes.

Profile Krunchin-Keith [USA]
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: Nov 10 05
Posts: 3221
Credit: 5,501,925
RAC: 3,659

I have various lengths on same host, Windows XP.

times in hours:minutes
1:27
1:24
3:14
2:13
1:40
1:05
2:06
etc

So far 22 issued over last day or so. 19 of those have completed and reported successfully, credit is being issued, some are still pending.

Chris Sutton
Send message
Joined: Nov 10 05
Posts: 297
Credit: 4,941,683
RAC: 0

At MacOS 10.5.x on Intel with BOINC client 6.2.18 the CPU throtteling did not work at some stages. At some stages the CPU jumps to 100% and at the same time the disk io increases to about 2..3 MB/s. The duration is about 15 sec, repating every several minutes.

From the symptoms, this sounds like the checkpoints. Does it co-incide with your "Write to disk at most every" setting?

Dotsch
Avatar
Send message
Joined: Jun 21 06
Posts: 65
Credit: 35,926
RAC: 5

At MacOS 10.5.x on Intel with BOINC client 6.2.18 the CPU throtteling did not work at some stages. At some stages the CPU jumps to 100% and at the same time the disk io increases to about 2..3 MB/s. The duration is about 15 sec, repating every several minutes.

From the symptoms, this sounds like the checkpoints. Does it co-incide with your "Write to disk at most every" setting?

Yes, i thought also that the checkpoints are the cause for the IO. The cycle fits to the BOINC preferences.

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

As you may have seen, we have some issues with the assimilator, which is crashing on some of the validated results. It's a problem to do with the output file formatting, and we'll need to fix it before sending out more work. This may take a few days.

Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile Bill
Send message
Joined: Jun 21 06
Posts: 11
Credit: 482,761
RAC: 114

I had to abort two units that ran to time limit, but were stuck in restart loops restarting but not progressing.

2/28/2009 7:05:34 AM|malariacontrol.net|Restarting task wu_609_180091_1_1_1235529486_0 using malariacontrolBeta version 612
2/28/2009 7:09:55 AM|malariacontrol.net|Restarting task wu_609_60004_1_0_1235529483_2 using malariacontrolBeta version 612


____________

Chris Sutton
Send message
Joined: Nov 10 05
Posts: 297
Credit: 4,941,683
RAC: 0

A user reported error: malariacontrolo test app 6.12 failing w/ error code -1 (0xffffffff)

Profile Bymark
Avatar
Send message
Joined: Jul 5 06
Posts: 155
Credit: 9,347,187
RAC: 1

I hope this New version 6.12 has a GPU support?
GPU calculate boinc so much faster than CPU.....

Regards

Silakka

____________

AndrewB57
Send message
Joined: May 9 08
Posts: 3
Credit: 61,483
RAC: 0

Back up and running, looking good so far
____________

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

We've finally fixed the assimilator, and are currently sending out more work. Thanks for submitting problem reports! I should have mentioned that we have quite good information in our on workunits that simply crashed (this is sent back to the server by your BOINC client). What is really useful is information on why you aborted a certain workunit (like Bill's "stuck in restart loops" above).

By far the most frequent problem right now is library incompatibilities on linux hosts. We'll make deploying a recompiled linux app our first priority.
Thanks
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,732
RAC: 80

We've finally fixed the assimilator, and are currently sending out more work. Thanks for submitting problem reports! I should have mentioned that we have quite good information in our on workunits that simply crashed (this is sent back to the server by your BOINC client). What is really useful is information on why you aborted a certain workunit (like Bill's "stuck in restart loops" above).

By far the most frequent problem right now is library incompatibilities on linux hosts. We'll make deploying a recompiled linux app our first priority.
Thanks
Nick


I have yet to receive any work units, even though I am signed up for everything (I think). My machine has two Xeon processors and 8 GBytes RAM, so that should not be a problem. I get the message that no work is available even though the server status says that there is work available. I run Red Hat Enterprise Linux 5, if that matters.

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2942
Credit: 926,890
RAC: 1,261


I have yet to receive any work units, even though I am signed up for everything (I think). My machine has two Xeon processors and 8 GBytes RAM, so that should not be a problem. I get the message that no work is available even though the server status says that there is work available. I run Red Hat Enterprise Linux 5, if that matters.


The only work being sent out currently are Test units, so if you don't have the option to allow test WUs in your preferences you won't get anything.

P . P . L .
Avatar
Send message
Joined: Aug 27 08
Posts: 56
Credit: 500,976
RAC: 0

Hi.

Good to see the project up and running again.

pete.

____________

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,732
RAC: 80


I have yet to receive any work units, even though I am signed up for everything (I think). My machine has two Xeon processors and 8 GBytes RAM, so that should not be a problem. I get the message that no work is available even though the server status says that there is work available. I run Red Hat Enterprise Linux 5, if that matters.


The only work being sent out currently are Test units, so if you don't have the option to allow test WUs in your preferences you won't get anything.


I allow lots of test units. My preferences say, in part:

Run malariacontrol simulation application Yes
Run malariacontrol test application Yes
Run map predictor application Yes
Run optimizer application Yes


My message log says things like this:

Wed 11 Mar 2009 03:03:12 AM EDT|malariacontrol.net|Message from server: No work sent
Wed 11 Mar 2009 03:03:12 AM EDT|malariacontrol.net|Message from server: No work is available for malariacontrol.net
Wed 11 Mar 2009 03:03:12 AM EDT|malariacontrol.net|Message from server: No work is available for malariacontrol.net test version
Wed 11 Mar 2009 03:03:12 AM EDT|malariacontrol.net|Message from server: No work is available for Prediction of Malaria Prevalence
Wed 11 Mar 2009 03:03:12 AM EDT|malariacontrol.net|Message from server: No work is available for Estimation of parameters of infection dynamics (variable duration, max 4h)

David Ball
Send message
Joined: Apr 14 07
Posts: 4
Credit: 589,504
RAC: 302

I just had a work unit fail after 18 hours and 54 minutes with a compute error -177 (0xffffff4f).

Vista 32 bit home premium SP1 (current on updates) on an HP machine with Q6600 quad core cpu and 4 GB ram. Machine is a dedicated cruncher and is completely stock. NO Overclocking. Boinc client 6.2.19.

https://malariacontrol.net/result.php?resultid=44592526

Could this have been due to the system being restarted on patch tuesday to apply the Vista patches for this month? Maybe a problem in the checkpoint code?


Server state Over
Outcome Client error
Client state Compute error
Exit status -177 (0xffffff4f)
Computer ID 116241
Report deadline 14 Mar 2009 6:13:17 UTC
CPU time 68069.21
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Maximum CPU time exceeded
</message>
<stderr_txt>
load_rng seed1


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x77C37DFE

Engaging BOINC Windows Runtime Debugger...

Henk Metselaar
Send message
Joined: Sep 24 08
Posts: 1
Credit: 177,747
RAC: 17

I have a workunit that seems stuck at 94.507% completion. I noticed this yesterday and there has been no progress since. It's been there for about 10h of computing time. suspend/resume didn't help, so I cancelled it.

linux-2.6.26-i686, boinc 6.2.14,
thanks,
Henk

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Thank you all for helping with this first round of testing the new application! 6.13, which is a rebuild of the application for linux with a statically linked libstdc++, seems to work as expected, and the linux success rate has increased from 27% to 99% as a consequence. This enables us to have a closer look at the remaining problems on all platforms. We have plenty of error data to analyze and will stop sending out new work for now. We'll start again once we've identified and hopefully fixed the most severe remaining problems.
Have a nice weekend
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile rbo
Avatar
Send message
Joined: Apr 6 07
Posts: 4
Credit: 23,157,318
RAC: 0

Hello,

Could you please also include "Client Detached" as a value in tasks' outcome messages ?

Kind Regards

514 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
513 13 Mar 2009 12:49:43 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
512 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
511 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
510 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
508 13 Mar 2009 12:49:43 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
507 13 Mar 2009 12:49:43 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
506 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00 --- ---
505 13 Mar 2009 12:49:42 UTC 13 Mar 2009 12:58:41 UTC Over Client detached New 0.00



Thank you all for helping with this first round of testing the new application! 6.13, which is a rebuild of the application for linux with a statically linked libstdc++, seems to work as expected, and the linux success rate has increased from 27% to 99% as a consequence. This enables us to have a closer look at the remaining problems on all platforms. We have plenty of error data to analyze and will stop sending out new work for now. We'll start again once we've identified and hopefully fixed the most severe remaining problems.
Have a nice weekend
Nick


____________
rbo

Profile idahofisherman
Avatar
Send message
Joined: Jan 4 07
Posts: 5
Credit: 19,314
RAC: 0

It looks like the Wus are restarting continuously. Will have to abort them in order to free up the computer for other WUs.



Recieving the following messages and the work units will not finish in time.

Host Project Date Message
ELEMENTS malariacontrol.net 3/14/2009 5:39:13 AM Restarting task wu_501_514_2524_0_1236733575_1 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:39:13 AM Restarting task wu_506_232_2542_0_1236751092_0 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:43:17 AM Restarting task wu_501_514_2524_0_1236733575_1 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:43:17 AM Restarting task wu_506_232_2542_0_1236751092_0 using malariacontrolBeta version 612

____________

Profile idahofisherman
Avatar
Send message
Joined: Jan 4 07
Posts: 5
Credit: 19,314
RAC: 0

It looks like the Wus are restarting continuously. Will have to abort them in order to free up the computer for other WUs.



Recieving the following messages and the work units will not finish in time.

Host Project Date Message
ELEMENTS malariacontrol.net 3/14/2009 5:39:13 AM Restarting task wu_501_514_2524_0_1236733575_1 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:39:13 AM Restarting task wu_506_232_2542_0_1236751092_0 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:43:17 AM Restarting task wu_501_514_2524_0_1236733575_1 using malariacontrolBeta version 612
ELEMENTS malariacontrol.net 3/14/2009 5:43:17 AM Restarting task wu_506_232_2542_0_1236751092_0 using malariacontrolBeta version 612


After I aborted them recieved the following messags:

3/14/2009 5:59:58 AM malariacontrol.net Sending scheduler request: Requested by user.
3/14/2009 5:59:58 AM malariacontrol.net Reporting 2 completed tasks, not requesting new tasks
3/14/2009 6:00:03 AM malariacontrol.net Scheduler request completed: got 0 new tasks
3/14/2009 6:00:13 AM malariacontrol.net [error] garbage_collect(); still have active task for acked result wu_501_514_2524_0_1236733575_1; state 0
3/14/2009 6:00:13 AM malariacontrol.net [error] garbage_collect(); still have active task for acked result wu_506_232_2542_0_1236751092_0; state 0
3/14/2009 6:00:13 AM malariacontrol.net Computation for task wu_501_514_2524_0_1236733575_1 finished
3/14/2009 6:00:13 AM malariacontrol.net Output file wu_501_514_2524_0_1236733575_1_0 for task wu_501_514_2524_0_1236733575_1 absent
3/14/2009 6:00:13 AM malariacontrol.net Computation for task wu_506_232_2542_0_1236751092_0 finished
3/14/2009 6:00:13 AM malariacontrol.net Output file wu_506_232_2542_0_1236751092_0_0 for task wu_506_232_2542_0_1236751092_0 absent
3/14/2009 6:00:19 AM malariacontrol.net Sending scheduler request: To report completed tasks.
3/14/2009 6:00:19 AM malariacontrol.net Reporting 2 completed tasks, not requesting new tasks
3/14/2009 6:00:24 AM malariacontrol.net Scheduler request completed: got 0 new tasks
3/14/2009 6:00:24 AM malariacontrol.net Message from server: Completed result wu_501_514_2524_0_1236733575_1 refused: result already reported as error
3/14/2009 6:00:24 AM malariacontrol.net Message from server: Completed result wu_506_232_2542_0_1236751092_0 refused: result already reported as error
3/14/2009 6:00:24 AM malariacontrol.net [error] Got ack for task wu_501_514_2524_0_1236733575_1, but can't find it
3/14/2009 6:00:24 AM malariacontrol.net [error] Got ack for task wu_506_232_2542_0_1236751092_0, but can't find it


____________

Tom Philippart
Send message
Joined: Jun 25 06
Posts: 29
Credit: 220,888
RAC: 0

so far the application runs great for me! I'm happy to crunch some more malaria in the future.

Now that we have a C++ application, do you plan to work on optimizations and for instance an official SSE or SSE2, ... poweruser application like einstein, enigma milkyway and other projects with C/C++ science applications do? This would be great and sorry for asking this, I'm sure you had to work a lot to translate the application and I'm still asking for more, shame on me...
____________

Post to thread

Message boards : Number crunching : New version 6.12 of the malariacontrol science application ready for testing


Return to malariacontrol.net main page


Copyright © 2013 africa@home