A second science application for malariacontrol.net

Message boards : Malaria Control : A second science application for malariacontrol.net

Author Message
Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

We are preparing the second science application to be run using malariacontrol.net. We plan to upload this application next Tuesday (Feb 13th) if our last tests come out ok.
This application was developed at the Swiss Tropical Institute and predicts the spatial distribution of malaria. We\'ll post a short description of the scientific objectives in this thread early next week. As with all malariacontrol.net applications, the results will be published in peer reviewed literature.

There are a few things that are worth pointing out: This application makes use of BOINC\'s wrapper approach for legacy applications. This has a few drawbacks: there is no checkpointing and no feedback from the application about progress. Therefore a workunit will start from the beginning if interrupted. This should not be a big problem because the workunits are relatively short (less than half an hour on most PCs).

We have several batches of workunits that we plan to send out over the next few weeks. This first stage will comprise a total of 200\'000 workunits. The current simulation model will keep running and send out work in parallel.

Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1213
Credit: 3,503,340
RAC: 1,667

Thanks for the update. It\'s good to hear about what is happening.

Shame about the lack of check points. I run MC on my work laptop which gets turned off twice a day (to/from work) so I will be loosing crunch time because of it.

Live long and BOINC.


____________
Paul
(S@H1 8888)

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2695
Credit: 850,101
RAC: 1,184


There are a few things that are worth pointing out: This application makes use of BOINC\'s wrapper approach for legacy applications. This has a few drawbacks: there is no checkpointing and no feedback from the application about progress. Therefore a workunit will start from the beginning if interrupted. This should not be a big problem because the workunits are relatively short (less than half an hour on most PCs).


I see that the Wrapper process requires BOINC 5.5 or above. Can we assume that earlier clients will not be downloading the new application?

n.b. Guess it\'s time to do that upgrade to 5.8 now.

Profile Dagorath
Send message
Joined: Jun 26 06
Posts: 68
Credit: 71,310
RAC: 0

Thanks for the warning. I\'ll be detaching Malaria on Feb 12. No checkpoints is bad enough but no progress indicator as well? What??? Have you taken leave of your senses?!!


____________
--

Robbie Lawrence
Send message
Joined: Jan 4 07
Posts: 12
Credit: 39,680
RAC: 0

Thanks for the warning. I\'ll be detaching Malaria on Feb 12. No checkpoints is bad enough but no progress indicator as well? What??? Have you taken leave of your senses?!!


I was thinking about this -- World Community Grid has a system where you can choose what projects you wish to have sent to your computer to crunch. Since some people don\'t want to and can\'t run this new application, and if you have plans to introduce more applications on top of these two, maybe introduce something like that?

I for one will be more than happy to crunch the new and old applications, since my computers never go off, however, and it\'s great to see progress on the project. Good work guys!

Professor Desty Nova
Avatar
Send message
Joined: Mar 7 06
Posts: 3
Credit: 530,383
RAC: 517

Thanks for the warning. I\'ll be detaching Malaria on Feb 12. No checkpoints is bad enough but no progress indicator as well? What??? Have you taken leave of your senses?!!


Quote from the wrapper page: \"A legacy application is one for which an executable is available, but not the source code. Therefore it cannot use the BOINC API and runtime system. However, such applications can be run using BOINC.\"

So I guess if the original application doesn\'t have a checkpoints, and you can\'t change the sourcecode, you\'ll have to leave it like that.
____________


Professor Desty Nova
Researching Karma the Hard Way

Profile Lonely
Avatar
Send message
Joined: Mar 8 06
Posts: 2
Credit: 135,142
RAC: 3

Excuse my ignorance, but is it not possible to \'suspend\' WU\'s prior to exiting the program? If this were possible, it would avoid any loss of work and by doing so, allow those who cannot have their PC\'s running 24/7 to continue their support without loss of work/time/energy consumed. As indicated, some will leave MC and go elsewhere... it may prove a loss you can ill afford.
____________

Keck_Komputers
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: Nov 10 05
Posts: 29
Credit: 281,138
RAC: 495

Excuse my ignorance, but is it not possible to \'suspend\' WU\'s prior to exiting the program? If this were possible, it would avoid any loss of work and by doing so, allow those who cannot have their PC\'s running 24/7 to continue their support without loss of work/time/energy consumed. As indicated, some will leave MC and go elsewhere... it may prove a loss you can ill afford.

Sorry but if the app does not checkpoint suspending does no good.
____________
BOINC WIKI

BOINCing since 2002/12/8

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 704,023
RAC: 715

Suspend alone is usually not a problem, as long as \"Leave applications in memory while suspended?\" is set to \"Yes\".

The checkpoint is needed as soon as the application ends, i.e. leaves the computer memory.

Profile Dagorath
Send message
Joined: Jun 26 06
Posts: 68
Credit: 71,310
RAC: 0

The lack of checkpoints is not a big concern for me but it will be for many other crunchers.

The big problem is the lack of progress indicator...we will never know if the app has been sitting there spinning its wheel for 2 or 3 days unless we keep a list of WUs and the times they start. We will always be facing the \"to abort or not to abort\" conundrum and wondering if the WU is just an extra long WU or whether it\'s stalled.

Nope, not for me!! That\'s far too much work and bother when the only favors I get in return are worthless credits, worn out hard drives and fans and power bills. This second app needs to be offered as an option. When the admns get that fixed up and get fixed credits working (far more important than a second app) then they can email me and let me know, I might even consider crunching Malaria again.



____________
--

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Here\'s a little bit of additional information on the new science application. We hope this will help you to decide if you want to sit out for the estimated 2-3 weeks it will take to go through the various batches.
-The new application will only run on windows clients. All other hosts will get workunits just like before.
-We have only tested the new application with core clients of version greater or equal 5.4 (We don\'t need min 5.5 because we use a modified wrapper app). Hosts with older core client versions will get workunits just like before.
-All workunits take more or less constant time to complete. On our 2.4 GHz pentium 4, a workunit takes just under 20 mins to complete. This means that you should be able to tell when a workunit gets stuck (we have not seen this during the tests). Further, we have assigned a fixed credit to each workunit of 1.8. This is our best estimate to match the credit or the previous application. We can adjust this if necessary after we get results back from the first small batches.

The remainder of this post outlines the scientific objectives of the new app:

Plasmodium falciparum malaria is the world’s most important parasitic disease, with a major cause of morbidity and mortality in Africa. A frequently quoted estimated is that in 1995 in sub-Saharan Africa around 1 million deaths and 220 million clinical episodes were directly attributable to malaria mostly in children below the age of 5yrs. In epidemic prone malaria areas in Southern Africa, about 2000 deaths and 200,000 clinical episodes occur annually. However these figures are very uncertain, since reliable maps of the distribution of malaria transmission and the numbers of affected individuals are not available for most of the African continent. Reliable maps of the geographical distribution of malaria are urgently needed for accurate estimation of disease burden, for identifying which geographical areas should be prioritised for purposes of resource allocation and for assessing the progress of intervention programs.
The Mapping Malaria Risk in Africa (MARA/ARMA) project was established in 1996 to provide estimates of the distribution of malaria in Africa. It is a collaborative network of key African scientists and institutions with the aim of providing an atlas of malaria for evidence-based and targeted malaria control in Africa. The Swiss Tropical institute is an active partner of this collaboration. To date results of well over 10,000 malaria prevalence surveys have been collated from published and unpublished sources been collated into a single, electronically accessible repository representing the most comprehensive database on malaria in Africa.
The current application analyses malaria survey data from the MARA database collected over 300 locations in West and Central Africa. We fit a Bayesian geostatistical model to relate malaria prevalence to environmental factors such as rainfall, temperature, vegetation and abundance among others which were gathered via Remote Sensing. Based on this model we predict the malaria risk at location with no prevalence data over a grid of 200\'000 pixels.
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

renke
Send message
Joined: Jun 26 06
Posts: 2
Credit: 259,413
RAC: 0

The new application will only run on windows clients


*sigh* no one loves penguins...

Chaz
Send message
Joined: Jun 22 06
Posts: 4
Credit: 21,095
RAC: 0

Nick
Thanks for the extra info. I\'m sorry to see that one or two people have reacted badly to this new application. I am not a 24/7 cruncher but I assume you will allow a long enough deadline so if the odd work unit ends up being restarted that will not cause a problem.
I think the general reaction of most of the crunchers on this project will be to accept that at the moment this is the only way you can run this particular application and that checkpointing isn\'t such a major issue on 20min WU\'s. Please don\'t be disheartened by the odd negative response I think you have already found that the majority of us will do our best to support this project although you may get the odd bit of constructive criticism.
Cheers
Chaz
____________

Profile Contact
Avatar
Send message
Joined: Jun 24 06
Posts: 2
Credit: 87,852
RAC: 0

maire wrote:


-The new application will only run on windows clients. All other hosts will get workunits just like before.

This app - predictor 112, will not run properly on Win9x. A DOS box is opened when app starts.

Unless you see a way to deny Win98 receiving this app until a possible fix, we should suspend Win98 hosts.

Post a news item if you want to test this app with Win98 in the future and we\'ll resume.

Alternatively, maybe you can use a section of prefs (already in BOINC code, I think) to determine if we want to run new apps, so we can run the older apps only if on Win9x.

For sure, keep up the good work!


____________

Click and enter your name for your BOINC Statistics

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0


This app - predictor 112, will not run properly on Win9x. A DOS box is opened when app starts.

Unless you see a way to deny Win98 receiving this app until a possible fix, we should suspend Win98 hosts.

Post a news item if you want to test this app with Win98 in the future and we\'ll resume.

Alternatively, maybe you can use a section of prefs (already in BOINC code, I think) to determine if we want to run new apps, so we can run the older apps only if on Win9x.

For sure, keep up the good work!

Thanks, we\'re currently looking at the results from the first batch we sent out. We haven\'t sent any new workunits in the last 10 hours, but of course a few will be resent.
I\'ll let you when and how we\'ll proceed.
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1213
Credit: 3,503,340
RAC: 1,667

Just completed a few wu\'s. I like the 1.8 credits for 660 seconds of work!

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Just completed a few wu\'s. I like the 1.8 credits for 660 seconds of work!

We\'ve chosen that to match the credit per time we got on a small number of reference computers here that run both applications. We may adjust it a little for future workunits if we see it\'s too generous. Linux users should not have a disadvantage.
Nick
____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0


This app - predictor 112, will not run properly on Win9x. A DOS box is opened when app starts.

Dear Win98 users, we investigated this problem and have so far not found an workaround. It\'s caused by Win98 behaving a bit differently when starting a new process. The good news is that the program runs correctly, as long as you don\'t close that window. We are getting back valid results from win 98 clients. We therefore decided to start sending out workunits again. In the meantime we look for a workaround.
Nick

____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1213
Credit: 3,503,340
RAC: 1,667

Just completed a few wu\'s. I like the 1.8 credits for 660 seconds of work!

We\'ve chosen that to match the credit per time we got on a small number of reference computers here that run both applications. We may adjust it a little for future workunits if we see it\'s too generous. Linux users should not have a disadvantage.
Nick

LOL...Linux users get screwed for credit with BOINC when compared to windows in any case. Linux benchmarks lower than windows on the same machine so claims lower. I wouldn\'t adjust it purely for that reason as the problem is not caused by the project, just BOINC. I would be comparing it to other projects. I feel the 1.8 credits is just about right.

I only received 1 of the new wu\'s on my laptop and it was completed before I had to shut it down. So I didn\'t waste any cpu cycles.

u.dgl.
Send message
Joined: Mar 8 06
Posts: 26
Credit: 1,117,120
RAC: 728

Hello,

it seems, that this wu is stuck:

15.02.2007 20:39:25|malariacontrol.net beta|Starting task mapwca0000602.txt_1 using mappredictor version 112

It is running already 12h 57min at 0.000%


u.dgl.
____________

u.dgl.
Send message
Joined: Mar 8 06
Posts: 26
Credit: 1,117,120
RAC: 728

An other one was stuck:

16.02.2007 16:11:22|malariacontrol.net beta|Starting task mapwca0000822.txt_1 using mappredictor version 112

u.dgl.

____________

KAMasud
Send message
Joined: Jan 7 07
Posts: 12
Credit: 18,733
RAC: 0


Running Climate on Win9x also opens a DOS box but i get valid results if i dont shut the DOS box :-) i dont mind these short WU\'s for a change, please keep them coming:-) as it is i have set write to disk every 400 sec in Preferences. :-( and i am from that part of the world where you cant rely on power :-)
Regards
Masud.


This app - predictor 112, will not run properly on Win9x. A DOS box is opened when app starts.

Dear Win98 users, we investigated this problem and have so far not found an workaround. It\'s caused by Win98 behaving a bit differently when starting a new process. The good news is that the program runs correctly, as long as you don\'t close that window. We are getting back valid results from win 98 clients. We therefore decided to start sending out workunits again. In the meantime we look for a workaround.
Nick


____________

B-Roy
Send message
Joined: Jul 14 06
Posts: 9
Credit: 9,858
RAC: 28

Is it actually normal that there is no screensaver for the wu \"Prediction of Malaria Prevalence 1.12\"?
____________

Hans Sveen
Send message
Joined: Mar 7 06
Posts: 2
Credit: 274,761
RAC: 198

An other one was stuck:

16.02.2007 16:11:22|malariacontrol.net beta|Starting task mapwca0000822.txt_1 using mappredictor version 112

u.dgl.

Hi!
Also got two that was stuck, one run for almost 9.5 hour( res id: https://malariacontrol.net/result.php?resultid=4423465) and one for nearly 1.5 hour(https://malariacontrol.net/result.php?resultid=4429738), both was on host id 298.
Also got one valid result https://malariacontrol.net/result.php?resultid=4432377 on host id 250.

A wild guess: The first one is dual core cpu , the second an ordinary single core cpu, maybe this will help solving the issue with \"hunging\" apps?

With regards,

____________
Hans Sveen
Oslo, Norway

u.dgl.
Send message
Joined: Mar 8 06
Posts: 26
Credit: 1,117,120
RAC: 728

Hi,

it seems that the dual core pc are the problem.

I have seen the same effect as Hans Sveen:

on my dual core pc the wus go stuck, the single core have no problem.

Greetings

u.dgl.
____________

Triciabuk
Avatar
Send message
Joined: Jan 19 07
Posts: 1
Credit: 9,440
RAC: 0

Hi,

it seems that the dual core pc are the problem.

I have seen the same effect as Hans Sveen:

on my dual core pc the wus go stuck, the single core have no problem.

Greetings

u.dgl.



HI

Can we tell whether a unit is the one of the new ones from the WU Number?

I have completed units which have taken about 600, 1200 and 1500 seconds and run perfectly OK on a dual core Pentium D. I don\'t know for sure that they were second application units though.

Regards

Tricia
____________

wolfsong
Send message
Joined: Feb 11 07
Posts: 4
Credit: 7,048
RAC: 0

Have also had problems with the new work units. Didn\'t think my PC was dual core but so far none of the 4 mappredict units that have downloaded to my pc have worked, the time just trundles past and no progress is made. Normal units are still working fine.

I cleared all my other project work units down over-night and I even tried resetting the project but the 2 mappredict units I got today are showing the same problem.

Bit frustrating really!

Profile Nightbird
Send message
Joined: Mar 7 06
Posts: 110
Credit: 395,345
RAC: 0

Same problem here with a Barton 3200+ under Millenium (CC 5.4.9). The \"new\" wu is running endless (since yesterday) but a normal wu is working fine.

edit :
I\'m using BoincView and according this manager, the cpu efficiency = 0.
____________

Do you want to get banned for 31 years and your account & credits deleted at a Boinc project ? Predictor@home is your best choice.

wolfsong
Send message
Joined: Feb 11 07
Posts: 4
Credit: 7,048
RAC: 0

On closer inspection I think I do have dual core (sys info shows 2 cpu\'s, I didn\'t know that!! shows how well I know my pc huh!). Guess that ties in with what the others have been saying.

Any chance of a response from the techies please? Do I let the units keep running? Seems a waste of time at the moment. Are they going to fix the problem? Is there anything I can do so that they will work?

Profile Nightbird
Send message
Joined: Mar 7 06
Posts: 110
Credit: 395,345
RAC: 0

On closer inspection I think I do have dual core (sys info shows 2 cpu\'s, I didn\'t know that!! shows how well I know my pc huh!). Guess that ties in with what the others have been saying.

Any chance of a response from the techies please? Do I let the units keep running? Seems a waste of time at the moment. Are they going to fix the problem? Is there anything I can do so that they will work?

The wu is suspended on my machine.
The best is to wait monday now.

____________

Do you want to get banned for 31 years and your account & credits deleted at a Boinc project ? Predictor@home is your best choice.

AnRM
Send message
Joined: Mar 7 06
Posts: 54
Credit: 2,130,571
RAC: 0

On closer inspection I think I do have dual core (sys info shows 2 cpu\'s, I didn\'t know that!! shows how well I know my pc huh!). Guess that ties in with what the others have been saying.

Any chance of a response from the techies please? Do I let the units keep running? Seems a waste of time at the moment. Are they going to fix the problem? Is there anything I can do so that they will work?


Well, I\'m not a techie but you don\'t have a dual core. I have a number of AMD dual core machines and they aren\'t having any problems with the new WUs. I would suggest that if you are using the BOINC screensaver, you change it to the \'blank\' option found on your list of Windows supplied screensavers. This will speed up your processing time and could eliminate your problem. The 2 CPU listing you were looking at is the BOINC default settings and only need to be changed for multicore CPUs (more than 2)ie. servers etc. Hope this helps....Rog.
Edit: Wolf, I see you are using BOINC version 5.4.11....you could also try upgrading to the current recommended version ie. 5.8.11 that is downloadable from the BOINC home page. FYI, our AMD X2\'s are configured with BOINC ver5.8.11, and blank screen savers. They seem stable and process these new WUs in about 7-8 minutes....Cheers, Rog.
____________

Robbie Lawrence
Send message
Joined: Jan 4 07
Posts: 12
Credit: 39,680
RAC: 0

This isn\'t specific to dualcore CPUs. Whether or not it is specific to a certain type of processor, I don\'t know but I somehow doubt it.
____________

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

This new wu is \"stuck\". It has so far run 08:38:52 at 100% CPU, so is not the \"says it is running but isn\'t\" fault I\'ve seen before. I stop/started BOINC as that normally clears stuck wu\'s, but won\'t know if this is fixed until it gets the CPU again, which with the huge negative debt it has run up overnight, may be sometime yet. In the \"stuck\" state, it has monopolised BOINC, none of my other projects have had any CPU overnight. I certainly hope I\'ve not got any of these at my remote site since I only get there occasionally.

Machine is a 2.8GHz Northwood, not a dual core in sight. NT4 SP6a BOINC core 5.8.8, no graphics, leave in memory set.

@WolfSong

You probably are showing 2 processors because your CPU has hyper threading, which makes a chip \"appear\" to have 2 processors when in fact Intel are using a few tricks to share two tasks on a single CPU. Both of my hyper threaded machines show up that way.

*** EDIT ***

For the sake of research, I have suspended the other projects to let the problem wu run, it has started again from zero as expected. Will watch and report.

I notice the other machine that is crunching that unit has not returned it yet either, normally that machine, like mine, returns wu\'s very quickly.

The wu has now been running again for 50:45 at 100%, looks like it is going nowhere, but I\'ll give it another 30 minutes or so just to give it a chance.

This is an insidious fault since there is no indication that anything is wrong to a casual glance.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Adywebb
Avatar
Send message
Joined: Jan 5 07
Posts: 15
Credit: 11,657
RAC: 0

Just to say I\'ve not had a problem with any of these new WU\'s so far - all completing without problem in around 10 minutes.
____________

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

Sorry, it won\'t let me edit any more!

It has run for 90 minutes now in it\'s second life. I have suspended it pending advice. Collectively, it has had over 10 hours now.

There are others having problems discussed in this thread.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Profile Nightbird
Send message
Joined: Mar 7 06
Posts: 110
Credit: 395,345
RAC: 0

Sorry, it won\'t let me edit any more!

It has run for 90 minutes now in it\'s second life. I have suspended it pending advice. Collectively, it has had over 10 hours now.

If the wu goes to run, it will finish probably with a \"Maximum CPU time exceeded\".
____________

Do you want to get banned for 31 years and your account & credits deleted at a Boinc project ? Predictor@home is your best choice.

wolfsong
Send message
Joined: Feb 11 07
Posts: 4
Credit: 7,048
RAC: 0

Thanks for all the help :) Was not using the screensaver at all, but have updated Boinc version. Not had a mappredict wu download yet today though so no idea if it made any difference!

Andreas
Send message
Joined: Feb 19 07
Posts: 6
Credit: 71,432
RAC: 61


Dear Win98 users, we investigated this problem and have so far not found an workaround. It\'s caused by Win98 behaving a bit differently when starting a new process. The good news is that the program runs correctly, as long as you don\'t close that window. We are getting back valid results from win 98 clients. We therefore decided to start sending out workunits again. In the meantime we look for a workaround.
Nick


When I ran the predictor app, the DOS box opened, but I also got this message over and over:

2007-02-19 15:56:21|malariacontrol.net beta|app reporting negative CPU: -737869762948.382080

Other than that nothing happens. I was afraid something was wrong, so I aborted those results. Should I have let them run?

Michael
Volunteer moderator
Project scientist
Send message
Joined: May 5 06
Posts: 79
Credit: 494
RAC: 0

Dear users,
there are a number of issues at the moment with the mappredictor application, which we would like to comment on.

- Stuck workunits: We see from the results we got back, that the client version clearly has no influence, while multi-core processors do have about a 2 times higher probability of the workunits getting stuck. But since also single core machines can have this problem, it cannot be the only reason. We limited the fpops limit for the workunits, so they will terminate by themselves within reasonable time if they get stuck. If they don\'t, please cancel them after an hour or so.. they should in fact not take longer than 30 minutes.

- Server load: as some may have noticed, the launch of the new application has made our server struggle a bit on a few occasions. We think that this has to do with the big number of hosts which are connecting for the first time and have to download the full application. This creates a huge amount of network traffic on our side.

- We see that our error rate does vary quiet strongly with time, and conclude from this (for the time being) that there might be a connection between the transiently high server load and some of the errors. Therefore we try to throttle the number of workunits launched per hour to value which the server can cope with, but still would like more clients to download the application files. We would like to reach a stable state, because if there are problems related to the high server load, there is no way for us to really tell what is what. Currently we\'re almost down to zero for this night, and will slowly go up again tomorrow.

We apologise for inconveniences, hope you understand that we do what we can, thank you very much for you collaboration, and hope you keep crunching for us!!

Cheers
Michael


____________
Michael

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1213
Credit: 3,503,340
RAC: 1,667

Thanks for the update Michael.

HomeGnome, yes you should have let it run for atleast 30minutes and maybe upto an hour before aborting it.

Live long and BOINC.

____________
Paul
(S@H1 8888)

u.dgl.
Send message
Joined: Mar 8 06
Posts: 26
Credit: 1,117,120
RAC: 728


- Stuck workunits: We see from the results we got back, that the client version clearly has no influence, while multi-core processors do have about a 2 times higher probability of the workunits getting stuck. But since also single core machines can have this problem, it cannot be the only reason. We limited the fpops limit for the workunits, so they will terminate by themselves within reasonable time if they get stuck. If they don\'t, please cancel them after an hour or so.. they should in fact not take longer than 30 minutes.


This sounds like a carnival joke!
I had on my dual core pc since beginning of the mappredictor application one! wu, that was not stuck. A few moments ago i aborted the latest

20.02.2007 08:29:43|malariacontrol.net beta|Unrecoverable error for result mapwca0006143.txt_3 (aborted by user)

That wu was running overnight more than 13 hours.

Greetings
u.dgl.

____________

Andreas
Send message
Joined: Feb 19 07
Posts: 6
Credit: 71,432
RAC: 61

HomeGnome, yes you should have let it run for atleast 30minutes and maybe upto an hour before aborting it.


OK, I\'ll do that. But what does the \"negative cpu\" message mean? Every second the predictor app is running I get this message. Other than that, nothing at all is happening. Is this normal on win98 machines?

Franken_Power
Avatar
Send message
Joined: Jan 5 07
Posts: 11
Credit: 304,316
RAC: 0


- Stuck workunits: We see from the results we got back, that the client version clearly has no influence, while multi-core processors do have about a 2 times higher probability of the workunits getting stuck. But since also single core machines can have this problem, it cannot be the only reason. We limited the fpops limit for the workunits, so they will terminate by themselves within reasonable time if they get stuck. If they don\'t, please cancel them after an hour or so.. they should in fact not take longer than 30 minutes.


This sounds like a carnival joke!
I had on my dual core pc since beginning of the mappredictor application one! wu, that was not stuck. A few moments ago i aborted the latest

20.02.2007 08:29:43|malariacontrol.net beta|Unrecoverable error for result mapwca0006143.txt_3 (aborted by user)

That wu was running overnight more than 13 hours.

Greetings
u.dgl.

If You don\'t like carnival jokes like this don\'t do beta projects.... :-)) I had a lot of this WU\'s on my AMD X2 5000+ and no errors....

Michael
Volunteer moderator
Project scientist
Send message
Joined: May 5 06
Posts: 79
Credit: 494
RAC: 0


This sounds like a carnival joke!
I had on my dual core pc since beginning of the mappredictor application one! wu, that was not stuck. A few moments ago i aborted the latest



Was not meant as a joke;) If we look ACROSS the hosts, we find a double probability for multi-processor machines. On a given single host it may well be quiet the same every time..

If you experience errors repeatedly, please try to reset the project. This causes the application files to be downloaded again. We would very much appreciate feedback from people who repeatedly had errors and did a reset (regardless of how many cpu\'s the host has). Did it change anything?

thanks
Michael


____________
Michael

KAMasud
Send message
Joined: Jan 7 07
Posts: 12
Credit: 18,733
RAC: 0


:-) Report time:-) my Prescott 3.06 is not handling them:-) my Celron 1.7 is not handling them as a matter of fact it has a WU stuck at the moment:-( will try it on my P4 2.0 and my P3 1.5 and also will reset the project on my Prescott and then let you chaps know:-)
Regards
Masud.
____________

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

We limited the fpops limit for the workunits, so they will terminate by themselves within reasonable time if they get stuck.

Can I ask what constitutes a reasonable time?

Why I ask is that my wu I described above ran for over 8 hours in a single instantiation without terminating. It was not time slicing either, BOINC was stuck crunching the MCDN wu, no other projects were seeing any CPU.

I have a couple of machines at a remote site that I don\'t visit everyday. If one or worse, both of these get a stuck unit, it might be a few days before I can visit the site to manually abort them.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

AnRM
Send message
Joined: Mar 7 06
Posts: 54
Credit: 2,130,571
RAC: 0


:-) Report time:-) my Prescott 3.06 is not handling them:-) my Celron 1.7 is not handling them as a matter of fact it has a WU stuck at the moment:-( will try it on my P4 2.0 and my P3 1.5 and also will reset the project on my Prescott and then let you chaps know:-)
Regards
Masud.

Masud, this must be very frustrating for you....we process about 400-500 MC WUs/day and have yet to see a stuck WU. We are running everything from Intel Celerons and AMD Durons to AMD64 X2 dual cores. Now these are 24/7, WinXP, \'blank\' screensaver, MC dedicated machines so they have few processing interruptions. I mention this only because I was wondering if excessive checkpointing, process interruptions or not leaving suspended WUs \'in memory\' could be causing this? I seem to recall that when Rosetta@Home started they had a similiar problem. Hope this helps....Rog.
____________

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

This is also happening on 24/7 leave in memory no graphics BOINC crunching only machines. That is exactly the kind of setup I had which stuck, what is more, the \"stick\" was repeatable, after 8.5 hours I stopped and started BOINC as that usually frees stuck wu\'s, I then suspended all other projects and let the same wu run again, it stuck again.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Profile Nightbird
Send message
Joined: Mar 7 06
Posts: 110
Credit: 395,345
RAC: 0


- Stuck workunits: We see from the results we got back, that the client version clearly has no influence, while multi-core processors do have about a 2 times higher probability of the workunits getting stuck. But since also single core machines can have this problem, it cannot be the only reason. We limited the fpops limit for the workunits, so they will terminate by themselves within reasonable time if they get stuck. If they don\'t, please cancel them after an hour or so.. they should in fact not take longer than 30 minutes.


This sounds like a carnival joke!
I had on my dual core pc since beginning of the mappredictor application one! wu, that was not stuck. A few moments ago i aborted the latest

20.02.2007 08:29:43|malariacontrol.net beta|Unrecoverable error for result mapwca0006143.txt_3 (aborted by user)

That wu was running overnight more than 13 hours.

Greetings
u.dgl.

If You don\'t like carnival jokes like this don\'t do beta projects.... :-)) I had a lot of this WU\'s on my AMD X2 5000+ and no errors....

Same here on my Amd X2\'s, running fine
Athlon64 X2 4400+ - XP home - CC 5.6.4
Athlon64 X2 4600+ - Win2k Sp4 - CC 5.4.9

____________

Do you want to get banned for 31 years and your account & credits deleted at a Boinc project ? Predictor@home is your best choice.

Profile Nightbird
Send message
Joined: Mar 7 06
Posts: 110
Credit: 395,345
RAC: 0

HomeGnome, yes you should have let it run for atleast 30minutes and maybe upto an hour before aborting it.


OK, I\'ll do that. But what does the \"negative cpu\" message mean? Every second the predictor app is running I get this message. Other than that, nothing at all is happening. Is this normal on win98 machines?

Same on a Millenuim machine
https://malariacontrol.net/forum_thread.php?id=378

____________

Do you want to get banned for 31 years and your account & credits deleted at a Boinc project ? Predictor@home is your best choice.

wolfsong
Send message
Joined: Feb 11 07
Posts: 4
Credit: 7,048
RAC: 0

I reset the project, made no difference. Both mappredict wu on the machine at the time failed to run and I aborted them. Normal units still working perfectly. No more mappredict units have downloaded to my machine since.

Michael
Volunteer moderator
Project scientist
Send message
Joined: May 5 06
Posts: 79
Credit: 494
RAC: 0


We have currently stopped sending out new workunits of the mappredictor application while we are trying to sort out this \"stuck\" problem. Our problem is, that this error is a bit difficult to reproduce here, coz we never get it :)
But maybe if we keep asking the right questions, you might be able to help us:

The new application consists of two processes, mappredictor_1.12... (the boinc wrapper application), and predictor_1.12.. (the science application). If your workunit gets stuck and you have a look at the task manager, are both processes still active, or only one? which one? (make sure only one workunit of this application is running, otherwise you might see something that belongs to another workunit).

Thanks for reporting, so far your comments were very helpful to us.
cheers
Michael


____________
Michael

FreeLarry
Send message
Joined: Jun 21 06
Posts: 5
Credit: 3,126,336
RAC: 0


We have currently stopped sending out new workunits of the mappredictor application while we are trying to sort out this \"stuck\" problem. Our problem is, that this error is a bit difficult to reproduce here, coz we never get it :)
But maybe if we keep asking the right questions, you might be able to help us:

The new application consists of two processes, mappredictor_1.12... (the boinc wrapper application), and predictor_1.12.. (the science application). If your workunit gets stuck and you have a look at the task manager, are both processes still active, or only one? which one? (make sure only one workunit of this application is running, otherwise you might see something that belongs to another workunit).

Thanks for reporting, so far your comments were very helpful to us.
cheers
Michael



mappredictor_1.12 - never saw this one running at any time in the task manager - it was ther just not using cpu
predictor_1.12 - only process i ever saw running in task manager - would stay even after unit supposedly finished and reported

Larry
____________

FalconFly
Avatar
Send message
Joined: Mar 7 06
Posts: 92
Credit: 5,517,713
RAC: 0

Not sure if it helps, but from my perspective, some Systems seem much more prone to the \"Stuck WU\" Problem than others.

On 24 Systems (22 Linux, 2 Win2000), I\'ve seen only 3 to ever have this Problem repeatedly.
These were Linux Boxes (two Dual Core, one Single Core) running an optimized 5.2.13 BOINC until a few days (switched all to official 5.8.11 release)

One System had the Problem far more often than the two others (upto twice a day at peak times) :
Host 1598

Still leaves me with no clue as to exactly why this was the one System most prone to have the Problem occur - it never had any Troubles with other Projects and did not show any abnormalities.

It\'s a native 64bit Fedora Core 4 System, Terminal only (no GUI), some unneeded Services disabled and at Standard BIOS/Performance settings (no Tweaks or Overclock) - setup the same way I have all Linux Systems running.

So far, after switching to 5.8.11, I did not witness any Stuck WorkUnits anymore, but it\'s proably too early to tell if that changed anything (had only 4.5 days, being a cumulative 2600 hours of V5.8.11 based CPU time so far)

Only if I see the new BOINC Version running without a Stuck WorkUnits for let\'s say a month (some 17000 hours total CPU time), I\'d go as far as to suspect that the new BOINC Version (somehow) fixed that elusive error.
____________
Scientific Network : 44800 MHz - 77824 MB - 1970 GB

FalconFly
Avatar
Send message
Joined: Mar 7 06
Posts: 92
Credit: 5,517,713
RAC: 0

Seems my hopes were too soon.

My Host 1598 catched a stuck one (WU_24_34_28669_0_497229148_2) again.

The only odd thing is that it got stuck after 24m 10s instead of stalling right at the beginning (where I saw the most stuck ones so far).
The task ran at 100% CPU load for over 100 Minutes, but basically made no progress - restarting BOINC helped as usual and the WorkUnit now will likely complete normal.




____________
Scientific Network : 44800 MHz - 77824 MB - 1970 GB

Michael
Volunteer moderator
Project scientist
Send message
Joined: May 5 06
Posts: 79
Credit: 494
RAC: 0

Dear users,
we have fixed the bug which was causing the workunits to get stuck. We are planning to release a new application version today and will start sending out new workunits..
____________
Michael

Franken_Power
Avatar
Send message
Joined: Jan 5 07
Posts: 11
Credit: 304,316
RAC: 0


We have currently stopped sending out new workunits of the mappredictor application while we are trying to sort out this \"stuck\" problem.


Just received a new bunch of mapwc cands and all made an error.....

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

we have fixed the bug which was causing the workunits to get stuck.

Sorry, but no you haven\'t!

This wu did the same as the other I reported, (this thread 18/2). It got stuck, and when it stuck, it would not release the CPU until it finally crashed. My machine, (not the same one as last time, was this one), has claimed 0.08 seconds of CPU time, but if you look at the message log for this machine, you can see that it actually grabbed the CPU at 17:52 yesterday and held it until it crashed, (or at least some event happened - see below), at 9:01 the next morning.

<core_client_version>5.8.11</core_client_version>
<![CDATA[
<message>
- exit code 1282 (0x502)
</message>
<stderr_txt>
o1
c1
app error: 0x502

entering rename_outfile
copying file to: ../../projects/malariacontrol.net/mapwca0044865.txt_1_0
error copying..2
copying file to: ../../projects/malariacontrol.net/mapwca0044865.txt_1_1
error copying..2
copying file to: ../../projects/malariacontrol.net/mapwca0044865.txt_1_2
error copying..2

</stderr_txt>
]]>


The files it mentions in the above trace are not present in the target directory.

As you can see below, no other tasks ran in the interim, the only action was Proteins@Home trying, and failing to get in touch with it\'s home, and MCDN reporting. All this time, SIMAP, Docking@Home and Rosetta were sitting waiting. At 22:36, Proteins tried again, got a \"wait for 31 seconds\" then nothing. There are no further log entries until MCDN aborts. 9:01 is \"about\" the time I arrived on site here, I saw the \"task crashed send error report\" type messagebox and pressed OK, so it may have been me pressing OK that started things up again. If I had not been to the site today, as I frequently am not, it may have sat like that for days.

This event demonstrates that BOINC is not dead since the download/upload scheduling is still functioning for a while, it is, however, not swapping tasks. It is possible that MCDN actually crashed at 22:36 and that it posted the messagebox causing BOINC to wait. I don\'t know, I wasn\'t here.

I said before, I have machines at a site that I don\'t visit every day. This event occurred there. If you continue to send these wu\'s, I will have no choice but to suspend MCDN at that site.

06/03/2007 17:52:13|proteins@home|Computation for task b.36.1.2.0A-63-46_0 finished
06/03/2007 17:52:13|malariacontrol.net beta|Starting mapwca0044865.txt_1
06/03/2007 17:52:14|malariacontrol.net beta|Starting task mapwca0044865.txt_1 using mappredictor version 114
06/03/2007 17:52:16|proteins@home|[file_xfer] Started upload of file b.36.1.2.0A-63-46_0_0.zip
06/03/2007 17:52:33|proteins@home|[file_xfer] Finished upload of file b.36.1.2.0A-63-46_0_0.zip
06/03/2007 17:52:33|proteins@home|[file_xfer] Throughput 53197 bytes/sec
06/03/2007 19:50:32|malariacontrol.net beta|Sending scheduler request: To report completed tasks
06/03/2007 19:50:32|malariacontrol.net beta|Reporting 1 tasks
06/03/2007 19:50:37|malariacontrol.net beta|Scheduler RPC succeeded [server version 507]
06/03/2007 19:50:37|malariacontrol.net beta|Deferring communication for 11 sec
06/03/2007 19:50:37|malariacontrol.net beta|Reason: requested by project
06/03/2007 20:16:36|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:16:36|proteins@home|Reporting 1 tasks
06/03/2007 20:16:58||Project communication failed: attempting access to reference site
06/03/2007 20:17:00||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:17:02|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:17:02|proteins@home|Deferring communication for 1 min 0 sec
06/03/2007 20:17:02|proteins@home|Reason: scheduler request failed
06/03/2007 20:18:02|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:18:02|proteins@home|Reporting 1 tasks
06/03/2007 20:18:24||Project communication failed: attempting access to reference site
06/03/2007 20:18:25||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:18:27|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:18:27|proteins@home|Deferring communication for 1 min 0 sec
06/03/2007 20:18:27|proteins@home|Reason: scheduler request failed
06/03/2007 20:19:28|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:19:28|proteins@home|Reporting 1 tasks
06/03/2007 20:19:50||Project communication failed: attempting access to reference site
06/03/2007 20:19:51||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:19:53|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:19:53|proteins@home|Deferring communication for 1 min 0 sec
06/03/2007 20:19:53|proteins@home|Reason: scheduler request failed
06/03/2007 20:20:53|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:20:53|proteins@home|Reporting 1 tasks
06/03/2007 20:21:14||Project communication failed: attempting access to reference site
06/03/2007 20:21:17||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:21:19|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:21:19|proteins@home|Deferring communication for 1 min 0 sec
06/03/2007 20:21:19|proteins@home|Reason: scheduler request failed
06/03/2007 20:22:19|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:22:19|proteins@home|Reporting 1 tasks
06/03/2007 20:22:41||Project communication failed: attempting access to reference site
06/03/2007 20:22:42||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:22:44|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:22:44|proteins@home|Deferring communication for 1 min 11 sec
06/03/2007 20:22:44|proteins@home|Reason: scheduler request failed
06/03/2007 20:24:00|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:24:00|proteins@home|Reporting 1 tasks
06/03/2007 20:24:22||Project communication failed: attempting access to reference site
06/03/2007 20:24:23||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:24:25|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:24:25|proteins@home|Deferring communication for 4 min 9 sec
06/03/2007 20:24:25|proteins@home|Reason: scheduler request failed
06/03/2007 20:28:37|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:28:37|proteins@home|Reporting 1 tasks
06/03/2007 20:28:59||Project communication failed: attempting access to reference site
06/03/2007 20:29:02||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:29:02|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:29:02|proteins@home|Deferring communication for 11 min 39 sec
06/03/2007 20:29:02|proteins@home|Reason: scheduler request failed
06/03/2007 20:40:46|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:40:46|proteins@home|Reporting 1 tasks
06/03/2007 20:41:08||Project communication failed: attempting access to reference site
06/03/2007 20:41:14|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:41:14|proteins@home|Deferring communication for 14 min 21 sec
06/03/2007 20:41:14|proteins@home|Reason: scheduler request failed
06/03/2007 20:41:15||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:55:40|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 20:55:40|proteins@home|Reporting 1 tasks
06/03/2007 20:56:02||Project communication failed: attempting access to reference site
06/03/2007 20:56:04||Access to reference site succeeded - project servers may be temporarily down.
06/03/2007 20:56:06|proteins@home|Scheduler request failed: couldn\'t connect to server
06/03/2007 20:56:06|proteins@home|Deferring communication for 1 hr 39 min 57 sec
06/03/2007 20:56:06|proteins@home|Reason: scheduler request failed
06/03/2007 22:36:07|proteins@home|Sending scheduler request: To report completed tasks
06/03/2007 22:36:07|proteins@home|Reporting 1 tasks
06/03/2007 22:36:13|proteins@home|Scheduler RPC succeeded [server version 509]
06/03/2007 22:36:13|proteins@home|Deferring communication for 31 sec
06/03/2007 22:36:13|proteins@home|Reason: requested by project

07/03/2007 09:01:02|malariacontrol.net beta|Deferring communication for 1 min 0 sec
07/03/2007 09:01:02|malariacontrol.net beta|Reason: Unrecoverable error for result mapwca0044865.txt_1 ( - exit code 1282 (0x502))
07/03/2007 09:01:02|malariacontrol.net beta|Computation for task mapwca0044865.txt_1 finished
07/03/2007 09:01:02|malariacontrol.net beta|Output file mapwca0044865.txt_1_0 for task mapwca0044865.txt_1 absent
07/03/2007 09:01:02|malariacontrol.net beta|Output file mapwca0044865.txt_1_1 for task mapwca0044865.txt_1 absent
07/03/2007 09:01:02|malariacontrol.net beta|Output file mapwca0044865.txt_1_2 for task mapwca0044865.txt_1 absent
07/03/2007 09:01:02|boincsimap|Restarting task 70303001.015376_0 using hmmer version 509

____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Profile maire
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Nov 7 05
Posts: 438
Credit: 118,258
RAC: 0

Except for a few workunits that are being resent, you should be no more map predictor jobs. This app in its current version caused more trouble than we expected. We are currently working on a new version.

We won\'t start sending new workunits for this application without prior warning. After successful in-house testing, we will first start distributing them on an opt-in basis to those users who agree to receive beta application work (beta in beta in our case...). Those workunits that have been sent back already have been used to create a preliminary map.
Nick


____________
Nicolas Maire
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

adrianxw
Avatar
Send message
Joined: Mar 8 06
Posts: 145
Credit: 474,763
RAC: 883

Cheers Nick. I\'m quite happy to run it here, but at my remote site it is risky. I\'m sure you understand.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

j2satx
Send message
Joined: Jan 4 07
Posts: 12
Credit: 2,063,480
RAC: 745

Except for a few workunits that are being resent, you should be no more map predictor jobs. This app in its current version caused more trouble than we expected. We are currently working on a new version.

We won\'t start sending new workunits for this application without prior warning. After successful in-house testing, we will first start distributing them on an opt-in basis to those users who agree to receive beta application work (beta in beta in our case...). Those workunits that have been sent back already have been used to create a preliminary map.
Nick



How does one \"opt-in\" to receive test WUs?

Ken Vogt
Avatar
Send message
Joined: Mar 7 06
Posts: 3
Credit: 65,618
RAC: 0

How does one \"opt-in\" to receive test WUs?

Hi j2satx, see this thread.
____________
Ken

Post to thread

Message boards : Malaria Control : A second science application for malariacontrol.net


Return to malariacontrol.net main page


Copyright © 2013 africa@home