Optimizer growing up


Advanced search

Message boards : Malaria Control : Optimizer growing up

AuthorMessage
Michael
Volunteer moderator
Project scientist
Send message
Joined: May 5 06
Posts: 79
Credit: 494
RAC: 0
Message 7832 - Posted 12 Sep 2008 9:40:56 UTC

    The "optimizer" application will again leave testing status some day during next week, starting from monday, 15.September.

    read below an old (but updated) post on what this application does, and how you can opt out of running it (by default, you should be getting workunits of this application, unless it is in testing state and you did not volunteer to run testing workunits)

    Only windows hosts will get work.
    ____________
    Michael

    Michael
    Volunteer moderator
    Project scientist
    Send message
    Joined: May 5 06
    Posts: 79
    Credit: 494
    RAC: 0
    Message 7833 - Posted 12 Sep 2008 9:43:16 UTC - in response to Message 7832.

      This post was last updated on 12. of Sept. 08.
      , maintained only for general information about the optimizer science application.

      Watch this thread for news, it will leave testing status during next week (from 15. September)

      At first, it will be run as a test application, meaning that only users who have \"run test applications\" and \"run optimizer application\" checked in their account settings (under malariacontrol.net preferences) will get work.



      In addition, only windows hosts will get work.


      Work units will take from 1 to 2 hours depending on the model parameters. Checkpointing is now being done, and progress will be indicated, but not in a very reliable way (so don't worry if it says 100% and then continues for a long time - max 2 hours for now.)

      Calculation is done by a java program, contained within the standard boinc-\"wrapper\" application. You don't need to have java installed.. a java runtime environment is included in the application.
      Deadlines: Three days.

      The name \"optimizer\" for this application was chosen because the server side components are essentially a \"general use\" optimization framework to be used by scientists in our group to work on more specific questions. E.g. to fit simpler models for which the \"big\" malaria model would just not be what you want. The insights from those calculations will help us to improve the main malariacontrol application in the future.


      On the science of the project:


      To make quantitative predictions of malaria transmission, it is very important to know how long an infection lasts in an infected human. Because the longer it lasts, the more mosquitoes can get infected, the more infected mosquitoes you have, the more humans are being infected etc, etc, etc..
      It may at first seem very straightforward to measure this: you just look when somebody gets infected, and then you keep taking blood samples until that person is not infected anymore.
      Unfortunately, you only have a chance of about 50% percent to detect an infection, given that it is there. So you already have a problem: you don\'t know when the infection started, and you don\'t know exactly when it ended.
      In addition: In areas of high malaria transmission people are very often infected with up to ten or more infections simultaneously... so you never know if what you\'re seeing is still the same infection or a new one..
      Recently some work at our institute has used new dna-based methods (which allow distinction of different infections), together with a mathematical approach, to estimate the average duration of an untreated p. falciparum infection.


      see Sama etal. 2006
      (sorry, only the abstract is freely available to the public)


      So far so good, this was an important step forward. The problem that remains is: how are the durations distributed? In other words: do all of the infections last exactly 200 days and then all of them stop? Or does an infection have a constant probability to disappear, which remains constant no matter how old the infection is? Probably none of the two is true, but we need to describe the shape of that distribution of durations somehow, in order to make sensible predictions.


      for more on that, see Sama etal.2006b


      That\'s almost where we want to go, except for one thing: the above paper measures the distribution in people living in the US who had never experienced malaria before. They were infected on purpose, to cure their syphilis (the method of choice at that time..) We don\'t know what the picture looks like in people living in areas of high transmission, with multiple infections at a time and after decades of being constantly infected...

      Attempts to find a mathematical solution to this problem did not work out.. the equations become unsolvable. But there is a way out: instead of using equations, we can use individual based simulations, that means we simulate every single infection in a computer program, and see what parameters can best produce the data we have. The big drawback there is, this just takes too long to calculate on a single computer.
      That\'s what we need you guys and girls for, and thanks a lot for making this possible!!

      P.S.: Something about the data collection mentioned above, to prevent misunderstandings: There are strict ethical guidelines on how one is allowed to obtain such data. Since most malaria infections in high transmission areas don\'t cause any symptoms, being infected with malaria doesn\'t mean you are sick (because of acquired immunity). People who did have symptoms were of course given treatment.
      ____________
      Michael

      Michael
      Volunteer moderator
      Project scientist
      Send message
      Joined: May 5 06
      Posts: 79
      Credit: 494
      RAC: 0
      Message 7863 - Posted 15 Sep 2008 7:26:54 UTC - in response to Message 7833.

        Thanks to all those who helped testing! After running the new version in testing status over the weekend, everything seems to be fine,no major issues, so we are leaving testing status now! It's always a great moment to "unleash" the full power of a boinc project and see the results coming back! Why am I so motivated, it's monday morning 9.20am? I think I have an interesting job, not always, but right now this is very exciting...:)
        We are confident to have a batch of sensible results from this application by end november latest. Then it will be written up and published, preferably in an open access journal, so we can post a link here.

        cheers


        ____________
        Michael

        John Clark
        Avatar
        Send message
        Joined: Feb 10 08
        Posts: 2314
        Credit: 1,412,023
        RAC: 3,767
        Message 7866 - Posted 15 Sep 2008 15:16:32 UTC

          Last modified: 15 Sep 2008 15:20:58 UTC

          Michael

          Where can I look up data/information on this optimiser client, as I run Win XP on both the rigs I use for Malaria?

          Mabe I am a little thick, but I assume "optimiser" refers to a Malaria project client which uses the specialist Intel instruction sets (like MMX, SSE, SSE2, SSE3, SSSE3x or SSE4.1)?

          I would love to volunteer to test if this is the correct view, or test when I understand what I can do to contribute, as well as using the stock client.

          I did not look around too deeply before I posted this, so I have some reading to carry out.

          Hopefully my questions will be answered during this reading, but if there is any short answers you might post I would apprecialte the understanding (both for me and from you).
          ____________
          Go away, I was asleep

          Said a Russell, 3 Shih-Tzus & a Bischeon Frize

          Profile Krunchin-Keith [USA]
          Volunteer moderator
          Volunteer tester
          Avatar
          Send message
          Joined: Nov 10 05
          Posts: 3319
          Credit: 5,590,665
          RAC: 460
          Message 7871 - Posted 15 Sep 2008 22:35:56 UTC - in response to Message 7866.

            Last modified: 15 Sep 2008 22:36:31 UTC

            Michael

            Where can I look up data/information on this optimiser client, as I run Win XP on both the rigs I use for Malaria?

            See my FAQ thread Running the different applciations in number crunching.
            See the 2nd post in this thread, it explains it too.

            Mabe I am a little thick, but I assume "optimiser" refers to a Malaria project client which uses the specialist Intel instruction sets (like MMX, SSE, SSE2, SSE3, SSSE3x or SSE4.1)?

            No it is not an optimized application for intel.
            See the second post in this thread for a description. It explains what optimizer means.


            I would love to volunteer to test if this is the correct view, or test when I understand what I can do to contribute, as well as using the stock client.

            Set your settings, the more the merrier.

            I did not look around too deeply before I posted this, so I have some reading to carry out.

            Hopefully my questions will be answered during this reading, but if there is any short answers you might post I would apprecialte the understanding (both for me and from you).

            John Clark
            Avatar
            Send message
            Joined: Feb 10 08
            Posts: 2314
            Credit: 1,412,023
            RAC: 3,767
            Message 7872 - Posted 15 Sep 2008 23:15:18 UTC

              Done as per the highlighted bits in your first post.

              Now we will see what happens.
              ____________
              Go away, I was asleep

              Said a Russell, 3 Shih-Tzus & a Bischeon Frize

              Michael
              Volunteer moderator
              Project scientist
              Send message
              Joined: May 5 06
              Posts: 79
              Credit: 494
              RAC: 0
              Message 7896 - Posted 17 Sep 2008 12:49:57 UTC - in response to Message 7872.

                good! thanks keith for helping..
                ____________
                Michael

                d_a_dempsey
                Send message
                Joined: Feb 29 08
                Posts: 3
                Credit: 2,662,581
                RAC: 3,481
                Message 7898 - Posted 17 Sep 2008 14:41:17 UTC

                  I have received a couple of these "optimiser packets." I assume these are the work units of the name "Estimation of parameters...". Two of these have reached 100.000% completion, have no more remaining completion time--and continue to run. They will go to a "waiting to run" status, back to "running" but have not completed, even after being 100% complete for almost 24 hours.

                  Is this normal?

                  Michael
                  Volunteer moderator
                  Project scientist
                  Send message
                  Joined: May 5 06
                  Posts: 79
                  Credit: 494
                  RAC: 0
                  Message 7899 - Posted 17 Sep 2008 15:05:42 UTC - in response to Message 7898.

                    ..does not sound good..

                    that it shows 100% and then continues is not bad by itself.. that indication of percentage is just approximate.. but not for 24 hours?
                    Maximum 2 hours (for the batch that went out now..)
                    please abort them by hand, and if possible, could you post a link to the workunits? that would help to find it.. this really shouldnt happen..

                    ____________
                    Michael

                    Augustine
                    Avatar
                    Send message
                    Joined: Mar 7 06
                    Posts: 36
                    Credit: 275,979
                    RAC: 25
                    Message 7902 - Posted 17 Sep 2008 15:50:57 UTC - in response to Message 7899.

                      Last modified: 17 Sep 2008 15:54:06 UTC

                      I noticed the same issue and these WUs took longer than 2h to complete sitting at 100% for most of the time, for too long even with the CPU throttled at 17%:


                      HTH
                      ____________

                      Michael
                      Volunteer moderator
                      Project scientist
                      Send message
                      Joined: May 5 06
                      Posts: 79
                      Credit: 494
                      RAC: 0
                      Message 7911 - Posted 17 Sep 2008 20:42:48 UTC - in response to Message 7902.

                        I noticed the same issue and these WUs took longer than 2h to complete sitting at 100% for most of the time, for too long even with the CPU throttled at 17%:


                        HTH



                        Augustine, those look correct, they took a long time, because the cpu was throttled.. for the first one it appears that the cpu was throttled to about 5% (maybe had more than one wu running at the same time??), and the second one about 15%, pretty close to the 17% you said.. if you compare "cpu time" on the pages above, with the actual time it took, and take into account the trottling, it seems ok to me..

                        ____________
                        Michael

                        Augustine
                        Avatar
                        Send message
                        Joined: Mar 7 06
                        Posts: 36
                        Credit: 275,979
                        RAC: 25
                        Message 7912 - Posted 17 Sep 2008 21:03:58 UTC - in response to Message 7911.

                          if you compare "cpu time" on the pages above, with the actual time it took, and take into account the trottling, it seems ok to me..

                          OK. I did have one system error out most WUs though...

                          TIA

                          ____________

                          Michael
                          Volunteer moderator
                          Project scientist
                          Send message
                          Joined: May 5 06
                          Posts: 79
                          Credit: 494
                          RAC: 0
                          Message 7918 - Posted 18 Sep 2008 13:28:56 UTC - in response to Message 7912.

                            Last modified: 18 Sep 2008 13:32:37 UTC

                            if you compare "cpu time" on the pages above, with the actual time it took, and take into account the trottling, it seems ok to me..

                            OK. I did have one system error out most WUs though...

                            TIA


                            Saw it.. very strange, since the contained science app actually terminated correctly (looking at std error), and also gave the correct result. It only affects that one computer of yours. I also noticed, that your client seems not to report back the app-version.. so something must be fishy there ...:)

                            Things to try:
                            - Reset project malariacontrol (so the applications are downloaded again.. )
                            - Try installing boinc in a different place (not "allusers/application_data...") am not sure about this, but it's an unusual place to have boinc installed
                            - otherwise i would recomment to opt out of running optimizer-wu's (check the "no" box (for "run optimizer app") in your account-> project settings.).
                            ____________
                            Michael

                            d_a_dempsey
                            Send message
                            Joined: Feb 29 08
                            Posts: 3
                            Credit: 2,662,581
                            RAC: 3,481
                            Message 7919 - Posted 18 Sep 2008 13:47:09 UTC

                              Last modified: 18 Sep 2008 13:47:28 UTC

                              Tasks with issues:

                              35473384
                              35456755
                              ____________
                              David

                              Augustine
                              Avatar
                              Send message
                              Joined: Mar 7 06
                              Posts: 36
                              Credit: 275,979
                              RAC: 25
                              Message 7921 - Posted 18 Sep 2008 14:54:39 UTC - in response to Message 7918.

                                Last modified: 18 Sep 2008 14:56:00 UTC

                                I also noticed, that your client seems not to report back the app-version.. so something must be fishy there ...:)

                                Unlike the other systems, it's running a beta client, 6.3.10. The WUs that succeeded do report the application version, but not those that failed. So maybe that's why.

                                Thanks.
                                ____________

                                Profile Ananas
                                Send message
                                Joined: Mar 7 06
                                Posts: 58
                                Credit: 752,054
                                RAC: 2
                                Message 8034 - Posted 1 Oct 2008 16:48:31 UTC

                                  Last modified: 1 Oct 2008 16:49:42 UTC

                                  I guess you're after a medal for the worst BOINC application?

                                  Each application task opened a GUI window asking me to install JAVA - great thing on an unattended cruncher - CPUs stuck for hours.

                                  After confirming the installation, they still crashed within no time, no reason given, just file transfer errors.

                                  It should at least have a warning ("Java required") behind the OptIn selection.

                                  Profile The Gas Giant
                                  Avatar
                                  Send message
                                  Joined: Mar 7 06
                                  Posts: 1214
                                  Credit: 3,713,861
                                  RAC: 1,183
                                  Message 8039 - Posted 2 Oct 2008 3:10:30 UTC

                                    I believe Java is installed by the Malaria Control application. You just need to ensure you have 'install' rights.....

                                    I always love it when people go off!

                                    Augustine
                                    Avatar
                                    Send message
                                    Joined: Mar 7 06
                                    Posts: 36
                                    Credit: 275,979
                                    RAC: 25
                                    Message 8041 - Posted 2 Oct 2008 4:10:46 UTC - in response to Message 8039.

                                      Last modified: 2 Oct 2008 4:11:25 UTC

                                      I believe Java is installed by the Malaria Control application. You just need to ensure you have 'install' rights...

                                      How does this play with BOINC 6.0's new protection scheme using its own users?

                                      TIA
                                      ____________

                                      Michael
                                      Volunteer moderator
                                      Project scientist
                                      Send message
                                      Joined: May 5 06
                                      Posts: 79
                                      Credit: 494
                                      RAC: 0
                                      Message 8060 - Posted 3 Oct 2008 8:47:07 UTC - in response to Message 8034.

                                        I guess you're after a medal for the worst BOINC application?

                                        Each application task opened a GUI window asking me to install JAVA - great thing on an unattended cruncher - CPUs stuck for hours.

                                        After confirming the installation, they still crashed within no time, no reason given, just file transfer errors.

                                        It should at least have a warning ("Java required") behind the OptIn selection.


                                        Hi Ananas,

                                        am sorry for the hassle you got with this application. And thank you for reporting.
                                        No, normally this application does not require a java installation, as a jre comes with the application itself (and it's not really being installed, just unzipped, meaning other apps won't find it, and it's gone after the slot is cleaned up - so no issues with boincs new user policy, i suspect).. however, when the launcher for the java app is executed, and it doesn't find a jre in the place where it should be, it starts looking for a pre-installed one, and tries to use that one.. only then, at the end, if everything fails, it prompts you to install java.. if you install java, that should then do the trick, and in the future the optimizer app will start using your newly installed jre.. this is not exactly what we wanted (since it should use the jre that comes along with app, not just any.. ) but it will work for you. Not completely for us though, since other people might have that error too:

                                        This is an interesting error report, because looking into your results shows that your errors all come out as this " -161 no output file find" .. and obviously a missing jre was the problem.. BUT unzipping of the the jre by the wrapper application didn't result in an error (that would show up in the stderr). I cannot really make sense of it, so far, but maybe this points into the right direction for resolving the -161 error - the jre is unzipped, but not being used by the launcher.. we have to look into this..





                                        ____________
                                        Michael

                                        Michael
                                        Volunteer moderator
                                        Project scientist
                                        Send message
                                        Joined: May 5 06
                                        Posts: 79
                                        Credit: 494
                                        RAC: 0
                                        Message 8123 - Posted 10 Oct 2008 14:55:42 UTC - in response to Message 8060.

                                          update: optimizer is going back to testing status, at least for a few days.. it really looks as if certain client versions don't do well (some extremely not well) with this application). Trying to figure out if, and which ones.. stay tuned
                                          ____________
                                          Michael

                                          Thyme Lawn
                                          Send message
                                          Joined: Jun 20 06
                                          Posts: 183
                                          Credit: 1,313,749
                                          RAC: 1,461
                                          Message 8230 - Posted 18 Oct 2008 7:58:53 UTC

                                            Last modified: 18 Oct 2008 8:01:14 UTC

                                            Looks like optimizer v1.55 has fixed the upper cannot equal lower problem with the opt_27_* tasks.

                                            Just returned my first one with the new version, only to get a too many total results validation failure (WU 13134127). The 7th task was sent out 4 hours after the 6th was returned. If the 7th task was created when the 6th was returned it shouldn't have been. If it already existed its state should surely have been set to server state Unsent and outcome Didn't need when the 6th was returned.

                                            The pair of tasks which ran longest have massively different claimed credit but the "stderr out" text between Application terminated and called boinc_finish is identical. This suggests the tasks would have validated successfully if the WU had been set up with maximum total results set to 7 (5 errors + 2 successful to meet the quorum) instead of 6.
                                            ____________
                                            "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                            Michael
                                            Volunteer moderator
                                            Project scientist
                                            Send message
                                            Joined: May 5 06
                                            Posts: 79
                                            Credit: 494
                                            RAC: 0
                                            Message 8293 - Posted 23 Oct 2008 10:12:39 UTC - in response to Message 8230.

                                              Looks like optimizer v1.55 has fixed the upper cannot equal lower problem with the opt_27_* tasks.

                                              yes, we did.. it was a bug in one of the java libraries we used



                                              Just returned my first one with the new version, only to get a too many total results validation failure (WU 13134127). The 7th task was sent out 4 hours after the 6th was returned. If the 7th task was created when the 6th was returned it shouldn't have been. If it already existed its state should surely have been set to server state Unsent and outcome Didn't need when the 6th was returned.


                                              that sounds to me like a server side problem: probably the "transitioner" had not processed the "newly" received yet - although 4 hours seems like a long time to me. There were some problems recently on the server side, causing the transitioner to be a bit behind.. this should be fixed now (will check back with nick) In addition, I set up the max nr of results, so this validation problem will not happen anymore..

                                              There are some problems remaining:
                                              we still have the -161 error (with nothing at all in the standard error), and have so far not been able to resolve it. but it seems to be machine dependent.The next app version will collect a bit more debugging information concerning this problem.

                                              Second, there is an error caused by a bug in client versions 6.2.14 upwards.. this bug is fixed in version 6.2.18, so please update your clients if you are among those who get something like "can't get shmem().." in your stderr.




                                              ____________
                                              Michael

                                              Thyme Lawn
                                              Send message
                                              Joined: Jun 20 06
                                              Posts: 183
                                              Credit: 1,313,749
                                              RAC: 1,461
                                              Message 8302 - Posted 23 Oct 2008 15:07:49 UTC - in response to Message 8293.

                                                I set up the max nr of results, so this validation problem will not happen anymore.

                                                Thanks Michael.
                                                ____________
                                                "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                                Michael
                                                Volunteer moderator
                                                Project scientist
                                                Send message
                                                Joined: May 5 06
                                                Posts: 79
                                                Credit: 494
                                                RAC: 0
                                                Message 8386 - Posted 30 Oct 2008 15:55:11 UTC - in response to Message 8302.

                                                  version 1.59 is out.
                                                  It brings a change in that it avoids some redundancy in the calculations, i.e if two persons only differ by age, but are otherwise identical, we only have to calculate the older one of them, and write down the intermediate results... then look the result for the younger ones up in that table.. something along those lines.

                                                  This will bring higher precision of results (or shorter calculation times).. , hopefully.. we are still testing what the benefit is.
                                                  However it is clear, that this strategy can only be applied in some special cases of models that you want to fit - this doesn't bother us at the moment, because all we wanna do is fit exactly those special cases..


                                                  In addition, it collects debugging information about the "Jsmooth" launcher for the java app (not in the standard error though, it comes back to us as a second output file and does not appear in your web-based interface to the result-database). With this information we hope to track down error -161, which is still around, though not very common, but yet unsolved..

                                                  cheers
                                                  ____________
                                                  Michael

                                                  Snugglebear
                                                  Send message
                                                  Joined: Jan 12 08
                                                  Posts: 2
                                                  Credit: 105,948
                                                  RAC: 0
                                                  Message 8398 - Posted 31 Oct 2008 23:02:00 UTC

                                                    Michael, has the issue with the infinite looping work units been resolved? I'm running a FreeBSD 7 x64 box with linux emulation that keeps hanging up on about 75% of the WUs sent my way. They will process through to 100% and then the system will go idle for upwards of 48 hours. Eventually some will be marked as finished and uploaded, but most will simply expire when the deadline passes. When I say idle, I mean it, too - the WU will sit there at 100%, CPU usage drops to zero, and boinc will not upload the result nor process the next unit. From playing around it appears that suspending and then resuming the WUs will cause the unit to resume processing for a few minutes and most of the time that allows them to complete successfully. Some, though, require four or five restarts in order to complete. Any assistance would be appreciated; manually restarting the jobs is tiresome.

                                                    Example WU that hung @ 100% and completed after restart:
                                                    wu_119_501_306221_0_1225446497_0

                                                    Sysinfo:
                                                    FreeBSD x.y.z 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #0: Mon Oct 6 19:41:19 PDT 2008 root@x.y.z:/usr/obj/usr/src/sys/GENERIC amd64
                                                    Boinc 6.2.14
                                                    Platform is x86_64-pc-freebsd primary, i686-pc-linux-gnu alternate

                                                    Michael
                                                    Volunteer moderator
                                                    Project scientist
                                                    Send message
                                                    Joined: May 5 06
                                                    Posts: 79
                                                    Credit: 494
                                                    RAC: 0
                                                    Message 8402 - Posted 1 Nov 2008 11:21:09 UTC - in response to Message 8398.



                                                      Example WU that hung @ 100% and completed after restart:
                                                      wu_119_501_306221_0_1225446497_0

                                                      hey, thats a different application ! :) wrong thread...wI'll pass this on to nick..
                                                      ____________
                                                      Michael

                                                      Snugglebear
                                                      Send message
                                                      Joined: Jan 12 08
                                                      Posts: 2
                                                      Credit: 105,948
                                                      RAC: 0
                                                      Message 8406 - Posted 1 Nov 2008 21:33:10 UTC

                                                        Without seeing a nice architectural diagram it's all the same to me. Since yesterday there have been a string of good WUs that aren't hanging up. Averages have gone from 35/day to 80+/day.

                                                        Michael
                                                        Volunteer moderator
                                                        Project scientist
                                                        Send message
                                                        Joined: May 5 06
                                                        Posts: 79
                                                        Credit: 494
                                                        RAC: 0
                                                        Message 8420 - Posted 2 Nov 2008 14:51:20 UTC - in response to Message 8402.

                                                          update: The optimizer workunits have now (with version 1.60) an average duration of 2 hours, and a maximum duration of 4hours (compared to avg 1h, max 2h before).

                                                          We are doing this to improve our fitting process.. that's why we are trying different settings. We can't really test this in the "testing state", because we wouldn't have enough hosts for that.. but even though one could call it "testing" we're not really testing if the software runs - it does run. We are now just fiddling with the parameters to fine-tune it.

                                                          you betcha?
                                                          ____________
                                                          Michael

                                                          Thyme Lawn
                                                          Send message
                                                          Joined: Jun 20 06
                                                          Posts: 183
                                                          Credit: 1,313,749
                                                          RAC: 1,461
                                                          Message 8423 - Posted 2 Nov 2008 19:34:27 UTC

                                                            Last modified: 2 Nov 2008 19:35:27 UTC

                                                            Haven't had a version 1.60 task yet, but I've just spotted that checkpointing wasn't being done by version 1.59.

                                                            29-Oct-2008 23:53:13 [malariacontrol.net] [cpu_sched] Starting opt_14_-17470_6_200908137_0 (initial)
                                                            29-Oct-2008 23:53:13 [malariacontrol.net] Starting task opt_14_-17470_6_200908137_0 using optimizer version 155
                                                            29-Oct-2008 23:59:24 [malariacontrol.net] [checkpoint_debug] result opt_14_-17470_6_200908137_0 checkpointed
                                                            30-Oct-2008 00:01:01 [CPDN Beta] [checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            30-Oct-2008 00:02:18 [malariacontrol.net] [checkpoint_debug] result opt_14_-17470_6_200908137_0 checkpointed
                                                            30-Oct-2008 00:05:18 [malariacontrol.net] [checkpoint_debug] result opt_14_-17470_6_200908137_0 checkpointed

                                                            ... cut ...

                                                            02/11/2008 17:10:29|malariacontrol.net|Starting opt_51_-14954_6_966686504_2
                                                            02/11/2008 17:10:29|malariacontrol.net|[cpu_sched] Starting opt_51_-14954_6_966686504_2 (initial)
                                                            02/11/2008 17:10:30|malariacontrol.net|Starting task opt_51_-14954_6_966686504_2 using optimizer version 159
                                                            02/11/2008 17:17:14|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 17:32:41|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 17:48:11|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 18:03:40|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 18:19:06|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 18:34:34|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 18:50:14|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 19:05:48|CPDN Beta|[checkpoint_debug] result hadrm3spinupagvf_jdcp_1920_160_10002370_1 checkpointed
                                                            02/11/2008 19:10:50|malariacontrol.net|Computation for task opt_51_-14954_6_966686504_2 finished

                                                            ____________
                                                            "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                                            Thyme Lawn
                                                            Send message
                                                            Joined: Jun 20 06
                                                            Posts: 183
                                                            Credit: 1,313,749
                                                            RAC: 1,461
                                                            Message 8431 - Posted 3 Nov 2008 8:07:41 UTC - in response to Message 8423.

                                                              Haven't had a version 1.60 task yet, but I've just spotted that checkpointing wasn't being done by version 1.59.

                                                              I have now, and there's no checkpointing in version 1.60 either.
                                                              ____________
                                                              "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                                              Thyme Lawn
                                                              Send message
                                                              Joined: Jun 20 06
                                                              Posts: 183
                                                              Credit: 1,313,749
                                                              RAC: 1,461
                                                              Message 8440 - Posted 4 Nov 2008 7:56:15 UTC - in response to Message 8431.

                                                                I have now, and there's no checkpointing in version 1.60 either.

                                                                Correction. There is, it just seems to be very haphazard.

                                                                The first one I had ran for 68 minutes without a checkpoint.
                                                                03-Nov-2008 05:41:36 [malariacontrol.net] [cpu_sched] Starting opt_50_-30543_6_749280695_3 (initial)
                                                                03-Nov-2008 05:41:36 [malariacontrol.net] Starting task opt_50_-30543_6_749280695_3 using optimizer version 160
                                                                03-Nov-2008 06:49:49 [malariacontrol.net] Computation for task opt_50_-30543_6_749280695_3 finished

                                                                Second one ran for 150 minutes, checkpointed after 33 minutes and then every 16 minutes.
                                                                03-Nov-2008 11:27:26 [malariacontrol.net] [cpu_sched] Starting opt_23_-72292_6_905372596_1 (initial)
                                                                03-Nov-2008 11:27:27 [malariacontrol.net] Starting task opt_23_-72292_6_905372596_1 using optimizer version 160
                                                                03-Nov-2008 12:00:05 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 12:16:13 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 12:32:20 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 12:32:20 [malariacontrol.net] [cpu_sched] Preempting opt_23_-72292_6_905372596_1 (left in memory)
                                                                03-Nov-2008 16:29:22 [malariacontrol.net] [cpu_sched] Resuming opt_23_-72292_6_905372596_1
                                                                03-Nov-2008 16:29:22 [malariacontrol.net] Resuming task opt_23_-72292_6_905372596_1 using optimizer version 160
                                                                03-Nov-2008 16:45:35 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 17:01:53 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 17:18:05 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 17:34:13 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 17:34:13 [malariacontrol.net] [cpu_sched] Preempting opt_23_-72292_6_905372596_1 (left in memory)
                                                                03-Nov-2008 20:47:09 [malariacontrol.net] [cpu_sched] Resuming opt_23_-72292_6_905372596_1
                                                                03-Nov-2008 20:47:09 [malariacontrol.net] Resuming task opt_23_-72292_6_905372596_1 using optimizer version 160
                                                                03-Nov-2008 21:02:47 [malariacontrol.net] [checkpoint_debug] result opt_23_-72292_6_905372596_1 checkpointed
                                                                03-Nov-2008 21:11:21 [malariacontrol.net] Computation for task opt_23_-72292_6_905372596_1 finished

                                                                ____________
                                                                "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                                                Michael
                                                                Volunteer moderator
                                                                Project scientist
                                                                Send message
                                                                Joined: May 5 06
                                                                Posts: 79
                                                                Credit: 494
                                                                RAC: 0
                                                                Message 8519 - Posted 11 Nov 2008 18:01:15 UTC - in response to Message 8440.

                                                                  sorry for the late reply..
                                                                  I think, this occured when old workunits with different configurations were still in circulation when the new app-version was already out.


                                                                  But recently we are getting errors like this:

                                                                  Exit status -177 (0xffffff4f) ERR_RSC_LIMIT_EXCEEDED

                                                                  CPU time 593.1563

                                                                  <message>
                                                                  Maximum disk usage exceeded
                                                                  </message>


                                                                  even though maximum disk usage is at
                                                                  <rsc_disk_bound>750000000</rsc_disk_bound>

                                                                  750 megabite.. and that is way above what our app needs..

                                                                  did anybody out there experience this error and has deeper insigths?

                                                                  thanks

                                                                  ____________
                                                                  Michael

                                                                  Thyme Lawn
                                                                  Send message
                                                                  Joined: Jun 20 06
                                                                  Posts: 183
                                                                  Credit: 1,313,749
                                                                  RAC: 1,461
                                                                  Message 8526 - Posted 11 Nov 2008 20:33:14 UTC - in response to Message 8519.

                                                                    did anybody out there experience this error and has deeper insigths?

                                                                    Not personally, but the likely cause has been reported here.
                                                                    ____________
                                                                    "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

                                                                    Michael
                                                                    Volunteer moderator
                                                                    Project scientist
                                                                    Send message
                                                                    Joined: May 5 06
                                                                    Posts: 79
                                                                    Credit: 494
                                                                    RAC: 0
                                                                    Message 8528 - Posted 11 Nov 2008 21:34:51 UTC - in response to Message 8526.

                                                                      did anybody out there experience this error and has deeper insigths?

                                                                      Not personally, but the likely cause has been reported here.


                                                                      Ok thanks,
                                                                      I know how to fix it and will stop sending workunits of this app right now.. the stdout.txt files should be now problem, because the boinc client will remove them once the wu's have finished.

                                                                      Could anybody please post (some of the) contents of those files? because they contain debugging information. THX!
                                                                      ____________
                                                                      Michael

                                                                      Michael
                                                                      Volunteer moderator
                                                                      Project scientist
                                                                      Send message
                                                                      Joined: May 5 06
                                                                      Posts: 79
                                                                      Credit: 494
                                                                      RAC: 0
                                                                      Message 8538 - Posted 12 Nov 2008 8:28:12 UTC - in response to Message 8528.

                                                                        ok, the bug is fixed, version 161 is out, and we're slowly starting to send out workunits again...
                                                                        ____________
                                                                        Michael

                                                                        Profile necronomicon
                                                                        Send message
                                                                        Joined: Jul 2 06
                                                                        Posts: 10
                                                                        Credit: 1,625,591
                                                                        RAC: 0
                                                                        Message 8563 - Posted 13 Nov 2008 7:08:09 UTC - in response to Message 8528.

                                                                          did anybody out there experience this error and has deeper insigths?

                                                                          Not personally, but the likely cause has been reported here.


                                                                          Ok thanks,
                                                                          I know how to fix it and will stop sending workunits of this app right now.. the stdout.txt files should be now problem, because the boinc client will remove them once the wu's have finished.

                                                                          Could anybody please post (some of the) contents of those files? because they contain debugging information. THX!


                                                                          Do you still want one? I made a rar - I think it came out at 16mb so I could upload it.

                                                                          ____________

                                                                          Post to thread

                                                                          Message boards : Malaria Control : Optimizer growing up


                                                                          Return to malariacontrol.net main page


                                                                          Copyright © 2013 africa@home