openMalaria test version v6.65

Message boards : Number crunching : openMalaria test version v6.65

Author Message
Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

My C2D T7300 Vista system recently received a wuVFNEW_ task and 7 * wuVFNEWcov_ tasks. They were all reissues after very fast failures on other systems.

wuVFNEW_11184_1347370135_1 failed with the same error as the _0 task, namely exit status 68 with stderr:

ITN.description.anophelesParams.twoStageDeterrency.attacking: bounds not met: holeFactor≥0 holeFactor+1×(insecticideFactor+interactionFactor)≥0
Error in scenario XML file: ITN.description.anophelesParams.postprandialKillingFactor: expected baseFactor to be in range [0,1]
OpenMalaria: Domain error


The wuVFNEWcov_ tasks, however, seem to be running fine, with 2 completions so far and another 2 in progress. The tasks which failed prior to them being issued to me finished with exit status 66 and stderr:

XSD error: instance document parsing failed
:221:31 error: no declaration found for element 'twoStageDeterrency'
:227:27 error: element 'twoStageDeterrency' is not allowed for content model '(deterrency,preprandialKillingEffect,postprandialKillingEffect)'
:230:31 error: no declaration found for element 'twoStageDeterrency'
:236:27 error: element 'twoStageDeterrency' is not allowed for content model '(deterrency,preprandialKillingEffect,postprandialKillingEffect)'
17:40:47 (1304): called boinc_finish

I've compared the files current in the project directory with those in a month old backup and it looks like there are 2 different versions of scenario_30.xsd. My backup has a 168,703 byte file with no "twoStageDeterrency" element but the live directory has a 175,997 byte file with that element. This suggests that the failing tasks are using the older version of that file.

Edit: revisiting the first error, scenario_30.xsd doesn't contain the element "postprandialKillingFactor" but it does contain "postprandialKillingEffect".
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

arf ...

1. wuVFNEW_* : xml attribute baseFactor have been set to a value < 0 while it was expecting a value whitin a range [0,1]. The xml file was validated by the xsd but openMalaria is doing additional check for value ranges that's why this error was not detected before. In addition, before submiting a new batch, we test it on our computer and then send a sample on BOINC before sending the whole batch. Here it was the sample.
I'll check with the scientist who created the experiment what happened.

2.wuVFNEWcov_* :Maybe a file cache problem. We updated openMalariaBeta with a new version which is backward compatible and replaced the scenario_30.xsd with a new version on our server. So the new workunits are sent with the new xsd files and not the old one. However it seem's that it was not replaced on all clients. I'll check if it's a problem of server cache or boinc client cache.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

Thanks Michael.

Unfortunately I've just had 3 tasks fail very quickly on my C2Q Q6600 XP system with exit status 66 and stderr:

XSD error: instance document parsing failed
scenario_30.xsd:1:65 error: non-whitespace characters are not allowed in schema declarations other than appinfo and documentation
scenario_30.xsd:1:12 error: root element name of XML Schema document must be 'schema'
scenario_30.xsd:1:12 error: element 'soft_link' must be from the XML Schema namespace
:2:200 error: no declaration found for element 'scenario'
:2:200 error: attribute 'wuID' is not declared for element 'scenario'
:2:200 error: attribute 'analysisNo' is not declared for element 'scenario'
:2:200 error: attribute 'name' is not declared for element 'scenario'
:2:200 error: attribute 'schemaVersion' is not declared for element 'scenario'
13:53:36 (362172): called boinc_finish


Tasks were run with the new scenario_30.xsd file and the workunits were 78845810 (wu_3253_317_23378_0_1347448986), 78845854 (wu_3210_30_23379_0_1347448989) and 78848847 (wu_3210_49_23380_0_1347451925). So far 3 tasks have failed in the same way for the first 2 WUs and 2 for the last one.

My Q6600 has completed >50 wu_3210_* tasks and 1 wu_3253_* task with this combination of application and scenario file.

The reference to 'soft_link' on the 4th stderr line suggests that scenario_30.xsd was being soft linked (rather than copied) to the slot directory. Sure enough, just checked my client_state.xml file and a running and queued task have (note the missing for the queued task:


wu_3210_508_23047_0_1347129610

scenario_30.xsd
scenario_30.xsd




wu_3220_517_23377_0_1347447189

scenario_30.xsd
scenario_30.xsd



The queued task is the _3 for workunit 78843971 and the previous 3 have the same error.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Shurado
Send message
Joined: Dec 23 11
Posts: 4
Credit: 191,064
RAC: 501

I am experiencing the same problem. 14 tasks failed immediately with exit code 66.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

I forgot to mention that I've disabled the test application in project preferences until the project team have fixed the workunit problem.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

We removed the from the xml doc trying if this will replace the old scenario_30.xsd file with the new one on the client side but it was a bae idea :( ... we set it back as it was before.

The problem is that we have a new version of the scenario_30.xsd which replace an older version. In theory the new one should replace the old one each time you download a new workunit because it create a slot directory where workunites files are downloaded. It's only if you specify in the xml doc "".

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

We stopped the generation of new workunits for beta and we will wait until all the workunits are back.

What happened is that :
- We replaced the old scenario_30.xsd file with the new one and launched a sample set of a batch to test.

- The client which didn't have workunits from beta downloaded the new version.

- But some clients were still running workunits from beta. So they still have the scenario_30.xsd file in the project directory. We expected that when the client get new workunits, the scenario_30.xsd will be replaced with the new one because the checksum between the old and new file was different but that was not the case... The old files stayed and that's were the mess started.

So we have to wait until all the workunits from beta are back like that the scenario_30.xsd file will be deleted from the project folder ( "" is not in the xml doc so it won't keep it). And then everything will be fixed and we can restart to generate new workunits.

m
Send message
Joined: May 29 08
Posts: 4
Credit: 126,517
RAC: 113

Thanks, Michael, for the explanation. I'll restore the test app. and await new work.

jp.

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484


Crunching is grinding to a halt.

9/15/2012 10:23:11 AM | malariacontrol.net | Restarting task wu_3210_318_23272_0_1347324065_1 using openMalariaBeta version 665 in slot 0
9/15/2012 10:23:11 AM | malariacontrol.net | Sending scheduler request: To fetch work.
9/15/2012 10:23:11 AM | malariacontrol.net | Reporting 12 completed tasks, requesting new tasks for CPU
9/15/2012 10:23:22 AM | malariacontrol.net | [error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
9/15/2012 10:23:22 AM | malariacontrol.net | [error] No close tag in scheduler reply
9/15/2012 10:24:48 AM | malariacontrol.net | Fetching scheduler list
9/15/2012 10:24:50 AM | malariacontrol.net | Master file download succeeded
9/15/2012 10:24:55 AM | malariacontrol.net | Sending scheduler request: To fetch work.

One option would be to sit and wait until things straighten out, but I don't know if malariacontrol will ever be able to upload completed work units from my computer that have already failed to upload.

The completed work units are stuck in my computer -- Even if I try to abort them, they won't go away.

To get rid of stuck work units, my first effort was to uninstall and re-install Boinc, but the damaged work units re-appeared. So, I uninstalled Boinc again, and deleted the Boinc directories manually. When I re-installed, I had lost all my project data (i.e. Recent Average Credits were set to Zero).

My other 3 computers are filling fast with stuck work units, and I don't know what to do with them. Sit on my hands? Clearing out the stuck work units is obviously NOT an option. What can I do?

I wish new formattings of work units would be checked for viability before they're distributed.

walton748
Send message
Joined: Jan 9 10
Posts: 8
Credit: 1,576,555
RAC: 3,697

Hi there,

I can confirm this. I did a post regarding this on thursday in the Macintosh portion of this forum because at the time I was considering it a specific Mac problem, since two Macs where showing the symptom while simultaneously a Windows machine was still successfully communicating with the Malariacontrol servers. Since today between 15:32 and 15:51 UTC my Windows machine suffers from the same problem.

I guess it's about time to take it a bit serious.

Cheers everybody

walton748
Send message
Joined: Jan 9 10
Posts: 8
Credit: 1,576,555
RAC: 3,697

Neil, one more question: Do your computers really fill? To me it looks more like my BOINC instances just don't understand the server's answer to a request and actually have no knowledge of workunits they get assigned by the servers. The servers in turn do not assign more than the usual quantity in total as they always do, and stop there, so that I end up with a couple of workunits assigned but not actually worked upon (as I can check using my account on their website).

regards

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484

> Do your computers really fill?

Could be I misspoke. Now, I don't know what I meant by "filled."

Computer 1: It's got one work unit that I tried to abort because it errored. Clicking "Update," but it won't go away. No more work units coming in.

Computers 2 thru N: Plenty of WUs Ready to Report, and a lesser bunch of Computation Errors that ran for 2 seconds. No more work units being crunched, no more coming in, and none going out, cpu chips are so cold.

All computers are ready to run -- Boinc is not suspended or anything, but nothing's happening. All computers repeat the error messages (Can't parse / No close tag) a dozen times. "Communication Deferred" for hours, after which I guess they will try again; best luck.

Thanks for following my ramblings. I hope I cleaned up my ambiguities.

-neil-

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

> Do your computers really fill?

Could be I misspoke. Now, I don't know what I meant by "filled."

Computer 1: It's got one work unit that I tried to abort because it errored. Clicking "Update," but it won't go away. No more work units coming in.

Computers 2 thru N: Plenty of WUs Ready to Report, and a lesser bunch of Computation Errors that ran for 2 seconds. No more work units being crunched, no more coming in, and none going out, cpu chips are so cold.

All computers are ready to run -- Boinc is not suspended or anything, but nothing's happening. All computers repeat the error messages (Can't parse / No close tag) a dozen times. "Communication Deferred" for hours, after which I guess they will try again; best luck.

Thanks for following my ramblings. I hope I cleaned up my ambiguities.

-neil-


The wait seems to be over:
9/16/2012 6:19:35 AM | malariacontrol.net | Scheduler request completed: got 2 new tasks

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

All computers are ready to run -- Boinc is not suspended or anything, but nothing's happening. All computers repeat the error messages (Can't parse / No close tag) a dozen times. "Communication Deferred" for hours, after which I guess they will try again; best luck.

From the limited information I can see everything looks normal on the server side Neil (the task wu_3210_318_23272_0_1347324065_1 in the messages you posted was reported as successfully completed at 15 Sep 2012 17:59:39 UTC. If you are still getting scheduler reply corruptions BOINC Manager should have that task in the "Ready to report" state.

The computer it was run on (590902) received a batch of work at 16 Sep 2012 9:18:30 UTC. Again, if you're still getting scheduler reply corruption it's likely that those tasks (e.g. wu_3210_416_23379_0_1347448991_2) won't be listed in BOINC Manager.

Digging a bit deeper, you posted:

9/15/2012 10:23:11 AM | malariacontrol.net | Reporting 12 completed tasks, requesting new tasks for CPU

Assuming your timezone is UTC-4 that would tally with tasks timed as being reported at 15 Sep 2012 14:23:22 UTC on the server, but there were only 5 tasks reported at that time. Counting back to get the 12 reported tasks would put your first request failure at 15 Sep 2012 8:21:34 UTC (4:21:34 local time) and your last successful scheduler requests at 14 Sep 2012 12:45:13 UTC (completion report at 8:45:13 local time) and 14 Sep 2012 16:17:03 UTC (work fetch at 12:17:03 local time). That's pointing towards a significant change at your end sometime between 12:17:03 on 14th and 04:21:34 on 15th.

The scheduler reply is in the file sched_reply_www.malariacontrol.net.xml which is stored in the BOINC data directory.

The file should start with a tag and end with a tag. You posted the error messages:

9/15/2012 10:23:22 AM | malariacontrol.net | [error] Can't parse workunit in scheduler reply: unexpected XML tag or syntax
9/15/2012 10:23:22 AM | malariacontrol.net | [error] No close tag in scheduler reply

That indicates the reply contained a tag with no matching tag, suggesting that something is causing the replies to be truncated. What size is that file on your system?

For reference, here's the workunit block from my most recent scheduler reply:


28000000000000.000000
4000000000000000.000000
330000000.000000
450000000.000000
wu_3001_316_850557_0_1347797598
openMalariaA

wu_3001_316_850557_0_1347797598
scenario.xml


densities.csv
densities.csv


scenario_29.xsd
scenario_29.xsd



autoRegressionParameters.csv
autoRegressionParameters.csv



--compress-checkpoints=1


____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

walton748
Send message
Joined: Jan 9 10
Posts: 8
Credit: 1,576,555
RAC: 3,697

Thyme,

do you want me to send you one of my failed request replies separately? It is quite long, looks more complicated than the example you gave and I would not know which excerpt to post.

Here is one more observation: There was on request that got through after all of the test wus for that particular machine (ID: 163009) timed out. This one only contained regular field wus. The other computers continiue to fail communication. Maybe that indicates they fail, as long as the reply contains information on assigned test units.

Regards

walton748
Send message
Joined: Jan 9 10
Posts: 8
Credit: 1,576,555
RAC: 3,697

I withdraw. The very same machine just downloaded test units, while the others keep failing. I don't see any pattern and give up. Everybody have a good Sunday, or whatever remains of that.

Regards

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

Here is one more observation: There was on request that got through after all of the test wus for that particular machine (ID: 163009) timed out. This one only contained regular field wus. The other computers continiue to fail communication. Maybe that indicates they fail, as long as the reply contains information on assigned test units.

I've received and reported work for the branch A, branch B and test applications both yesterday and today, so it's definitely not a general problem.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

walton748 sent me the sched_reply_www.malariacontrol.net.xml file from a system which is generating the parse error and it highlighted a problem with the format of the block passed by the server for some of the test workunits (e.g. WU 78917406). The project has the option to resend lost tasks enabled, so affected hosts will continue to get parse errors until the task which is causing the problem times out.

I've passed the details on to Michael.

Again, I'd advise disabling the test application until the problem has been fixed.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

Here is one more observation: There was on request that got through after all of the test wus for that particular machine (ID: 163009) timed out. This one only contained regular field wus. The other computers continiue to fail communication. Maybe that indicates they fail, as long as the reply contains information on assigned test units.


I've received and reported work for the branch A, branch B and test applications both yesterday and today, so it's definitely not a general problem.


I am sorry I should have clarified I do NOT crunch the Test unit, just the A and B units. Sorry for any confusion!

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Dammit ... I think I need holidays or buy news glasses ...
By mistake (for Beta), changing the xml back to the original one, I removed the xml tag in the xml doc. XML fixed. ...

Sorry for all the mess. :( :( :( :( :( :(
____________
Michael Tarantino
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Some news from the battlefront, all the beta workunit have been sacrified (canceled), there are still some of them fighting with various errors ... but it will take a little bit of time until all the stack will be cleaned and all the workunits to be burried ...

So will still have some errors about the scenario_30.xsd for a couple of them. But we will soon come back with brand new "working" workunits.

Neil
Avatar
Send message
Joined: Dec 30 09
Posts: 20
Credit: 1,815,035
RAC: 1,484


I was just sitting there watching Star Trek On Demand, when I suddenly noticed my uProcessors were running full bore. All of my uProcessors.

Thanks to MichaelT and everyone.

My next question was going to be how to get rid of all those work units that were stuck inside my computers, but they've all cleared out -- Ready to Reports, Computing Errors, and Aborteds.

Somehow, I feel better when my Malariacontrol.net is running smoothly. Guess I should mention that to my shrink.

Best luck / Don't crunch too hard.
-n-

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

wuVFNEWcov_11_1347965467_0 was reported as a successful completion last night. This was the first wuVFNEWcov_ task I've run since the first post in the thread, but the stderr output does contain the following message:

ITN.description.anophelesParams.twoStageDeterrency.attacking: bounds not met: holeFactor≥0

____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Great it worked :)

For the message, no worries, it's a warning due to the experiment configuration we setuped, but we're expecting it. It will be took into account during the analysis of the results.

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

This was the first wuVFNEWcov_ task I've run since the first post in the thread, but the stderr output does contain the following message:

ITN.description.anophelesParams.twoStageDeterrency.attacking: bounds not met: holeFactor≥0

And my first wuVFNEW_ task since the first post (wuVFNEW_11159_1348502747_0) has also been successfully completed (with the same stderr message as wuVFNEWcov_ tasks).
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

My Q6600 has reported 4 successful wuVFNEW2_* tasks so far (e.g. wuVFNEW2_223_1349105777_0).

They all have the same "bounds not met: holeFactor≥0" message and the CPU time was very variable. The slowest (6:54:24) took 4 times longer than the quickest (1:43:28).
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

The first wucaseMgtSA_* task run by my Q6600 (wucaseMgtSA_1338_1350912446_0) has been reported as a successful completion with no unexpected messages.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

My Q6600 reported a large batch of wu59_* tasks overnight, all successful completions with no unexpected messages.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

My Q6600 reported a large batch of wu59_* tasks overnight, all successful completions with no unexpected messages.

It's now started reporting wu64_* tasks, again successful completions with no unexpected messages.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

wu218_68419_0 was successfully completed but ran a lot faster than any other openMalaria tasks on my Q6600. The task had run time 7.02 seconds and CPU time 3.42 seconds. By way of comparison, the branch A+B tasks completed during the past week have had a CPU time range between 16 and 320 minutes.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,233,724
RAC: 1,389

I received my first wu69_* tasks yesterday morning. So far 31 of them have been reported as successful completions with no unexpected messages.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Post to thread

Message boards : Number crunching : openMalaria test version v6.65


Return to malariacontrol.net main page


Copyright © 2013 africa@home