I started running malariacontrol on a dual Pentium III Linux box, and work units fail.

Message boards : Unix/Linux : I started running malariacontrol on a dual Pentium III Linux box, and work units fail.

Author Message
Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,677
RAC: 80

86236851 is a typical failure.

If I understand correctly, it complains that the application program is not there:

../../projects/malariacontrol.net/openMalariaA_6.58_i686-pc-linux-gnu [0x84c16f1]
OpenMalaria: No such file or directory

But if I check, it is there:

valinuxl:boinc[~/BOINC/projects/malariacontrol.net]$ ls -l
total 10856
-rw-r--r-- 1 boinc boinc 20185 Dec 17 15:01 autoRegressionParameters.csv
-rw-r--r-- 1 boinc boinc 38132 Dec 17 15:01 densities.csv
-rwxr-xr-x 1 boinc boinc 10854568 Dec 17 15:01 openMalariaA_6.58_i686-pc-linux-gnu
-rw-r--r-- 1 boinc boinc 141319 Dec 17 15:01 scenario_29.xsd
-rw-r--r-- 1 boinc boinc 30207 Dec 17 17:20 wu_2899_34_939960_0_1355781849

And it has all its pieces:

valinuxl:boinc[~/BOINC/projects/malariacontrol.net]$ file openMalariaA_6.58_i686-pc-linux-gnu
openMalariaA_6.58_i686-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.8, dynamically linked (uses shared libs), stripped
valinuxl:boinc[~/BOINC/projects/malariacontrol.net]$ ldd openMalariaA_6.58_i686-pc-linux-gnu
linux-gate.so.1 => (0x007df000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00c5c000)
libm.so.6 => /lib/libm.so.6 (0x00c15000)
libc.so.6 => /lib/libc.so.6 (0x00ab8000)
/lib/ld-linux.so.2 (0x00a99000)

And I think it is where it is supposed to be:

valinuxl:boinc[~/BOINC/projects/malariacontrol.net]$ locate openMalariaA_6.58_i686-pc-linux-gnu
/home/boinc/BOINC/projects/malariacontrol.net/openMalariaA_6.58_i686-pc-linux-gnu

Where do I go from here? Setiathome and world community grid applications are working OK.

____________

Michael Grossman
Send message
Joined: Jul 22 09
Posts: 2
Credit: 7,712
RAC: 0

Try running it from the command line directly:

./openMalariaA_6.58_i686-pc-linux-gnu


It works ok on my linux x86_64 system without any special configuring.
I installed it from the graphical BOINC manager.

What complains that the application is not there? BOINC manager?

Here's what's in my BOINC folder:

~/BOINC/projects/malariacontrol.net $ ls -l -h -G
total 12M
-rw-r--r-- 1 michael 20K Dec 17 10:53 autoRegressionParameters.csv
-rw-r--r-- 1 michael 38K Dec 17 10:52 densities.csv
-rwxr-xr-x 1 michael 12M Dec 17 10:53 openMalariaA_6.58_x86_64-pc-linux-gnu
-rw-r--r-- 1 michael 139K Dec 17 10:53 scenario_29.xsd
-rw-r--r-- 1 michael 29K Dec 20 16:23 wu_2899_402_943801_0_1356038234
-rw-r--r-- 1 michael 45K Dec 20 21:00 wu_2903_173_944059_0_1356054515


btw, use the Code button above to make your posting easier to read.

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,677
RAC: 80

Task 149302533

Name wu_3152_517_239622_0_1356075191_1
Workunit 86541019
Created 21 Dec 2012 8:10:44 UTC
Sent 21 Dec 2012 8:17:07 UTC
Received 21 Dec 2012 8:52:25 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 73 (0x49)
Computer ID 613682
Report deadline 25 Dec 2012 23:23:47 UTC
Run time 4.16
CPU time 2.07
Validate state Invalid
Credit 0.00
Application version openMalaria: A simulator of malaria epidemology and control (Branch B) v6.58
Stderr output

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 73 (0x49, -183)
</message>
<stderr_txt>
Exception: numNewInfections: NaN
in ../../projects/malariacontrol.net/openMalariaB_6.58_i686-pc-linux-gnu:
+0x330 OM::Host::InfectionIncidenceModel::numNewInfections(OM::Host::Human const&, double)
+0x55 OM::Host::Human::updateInfection(OM::Transmission::TransmissionModel*, double)
+0xc5 OM::Host::Human::update(OM::Population const&, OM::Transmission::TransmissionModel*, bool)
+0x12e OM::Population::update1()
+0x518 OM::Simulation::start()
+0x186 main()
in /lib/libc.so.6:
+0xdc __libc_start_main()
../../projects/malariacontrol.net/openMalariaB_6.58_i686-pc-linux-gnu [0x84c16f1]
OpenMalaria: No such file or directory
03:50:04 (15553): called boinc_finish

</stderr_txt>
]]>

What complains that the application is not there? BOINC manager?

I do not know. This is the entirety of the task description on the web site. I think this is a different work unit, but it has the same symptoms. A good one is running now and has over 6 hours on it. The failed ones all die in a few seconds.
____________

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,677
RAC: 80

The good one finished and validated. Others still fail in a few seconds.
I have two more in the queue, and I checked with ldd command and they all have their needed dependencies.
It is acting as though there are a bunch of bad work units. They all end with a complaint of NAN, that I assume means it gets a value of Not A Number where a number is expected. A good one ends like this:

Task 149304755

Name wu_1068_417_944371_0_1356079396_0
Workunit 86544599
Created 21 Dec 2012 8:43:24 UTC
Sent 21 Dec 2012 8:52:25 UTC
Received 23 Dec 2012 8:56:09 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 613682
Report deadline 25 Dec 2012 23:59:05 UTC
Run time 74,503.21
CPU time 70,583.00
Validate state Valid
Credit 31.69
Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58
Stderr output

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
sim end
T/A: 409465126/155890321
02:43:12 (16620): called boinc_finish

</stderr_txt>
]]>

____________

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,677
RAC: 80

If I knew I was getting bad work units, I would continue, but I got only two that run and the rest (quite a few) failed all with exit code 73 (I think): NaN.

Is it likely that anyone will enlighten me, or should I detach this machine from malariacontrol?

The machine seems all right. It has no trouble with setiathome, worldcommunity grid, and rosett@home. I ran 24 hours of memtest86 on it before installing CentOS 5 on it.

I had been running CentOS 4 for quite a few years before that and it worked well, but got less and less able to run some BOINC projects because its libraries were getting too old. The machine is too small and slow to upgrade to CentOS 6.
____________

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

If I knew I was getting bad work units, I would continue, but I got only two that run and the rest (quite a few) failed all with exit code 73 (I think): NaN.

Is it likely that anyone will enlighten me, or should I detach this machine from malariacontrol?

The machine seems all right. It has no trouble with setiathome, worldcommunity grid, and rosett@home. I ran 24 hours of memtest86 on it before installing CentOS 5 on it.

I had been running CentOS 4 for quite a few years before that and it worked well, but got less and less able to run some BOINC projects because its libraries were getting too old. The machine is too small and slow to upgrade to CentOS 6.


I am NOT a Linux guy but after a quick search found this that MAY help:
http://forums.fedoraforum.org/showthread.php?s=e6b77f1a95df605875066338b1f9553b&p=1576639#post1576639

I think you are finding that there are just not many Linux folks who use the forums very much here.

Jean-David Beyer
Send message
Joined: Jan 5 07
Posts: 18
Credit: 168,677
RAC: 80

I do not understand that link at all.

It seems to have to do with getting the boinc client to start under Fedora 17.
I am running it on CentOS 5 (also Red Hat Enterprise Linux 6 on another machine), and it starts just fine, and it runs other BOINC applications just fine.

It is just that when it runs maliariacontrol.net applications, I estimate 90% to 95% of them fail within a few seconds with NaN; the longest a failure has run is just under 8 seconds. Two have run correctly for many hours and their results validated. One had the following run time: 74,503.21
____________

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

I do not understand that link at all.

It seems to have to do with getting the boinc client to start under Fedora 17.
I am running it on CentOS 5 (also Red Hat Enterprise Linux 6 on another machine), and it starts just fine, and it runs other BOINC applications just fine.

It is just that when it runs maliariacontrol.net applications, I estimate 90% to 95% of them fail within a few seconds with NaN; the longest a failure has run is just under 8 seconds. Two have run correctly for many hours and their results validated. One had the following run time: 74,503.21


Sorry about that, I said I wasn't a Liunx guy!
I will end by saying MERRY CHRISTMAS TO ALL!!

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

I do not understand that link at all.

It seems to have to do with getting the boinc client to start under Fedora 17.
I am running it on CentOS 5 (also Red Hat Enterprise Linux 6 on another machine), and it starts just fine, and it runs other BOINC applications just fine.

It is just that when it runs maliariacontrol.net applications, I estimate 90% to 95% of them fail within a few seconds with NaN; the longest a failure has run is just under 8 seconds. Two have run correctly for many hours and their results validated. One had the following run time: 74,503.21

Looking at the input files for the pending MC tasks on my systems the wu_1068_* and wu_3152_* tasks have a totally different structure, so they probably follow completely different paths in the code. That could mean, for example, that the failed models tried to use instructions which aren't supported by the Pentium III (e.g. SSE2 or SSE3).

The task names are probably significant, but as all tasks have been purged from the database for your P3 you would have to generate the list manually. You'll find the successful ones in ~/BOINC/job_log_www.malariacontrol.net.txt, but the failed ones would have to be extracted from ~/BOINC/stdoutdae.txt and ~/BOINC/stdoutdae.old.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Post to thread

Message boards : Unix/Linux : I started running malariacontrol on a dual Pentium III Linux box, and work units fail.


Return to malariacontrol.net main page


Copyright © 2013 africa@home