Errors Overnight


Advanced search

Message boards : Number crunching : Errors Overnight

AuthorMessage
The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0
Message 18957 - Posted 27 May 2012 8:49:11 UTC

    Hi

    Just installed a brand new HD fully formated with a full system install. Did this because my system became unstable which caused all nine of the BOINC projects I participate in to keep failing.

    However, overnight despite the full wipe and reinstall I notice there are still some errors, (listed below) that I am curious to find out if there is something wrong with:
    1. the BOINC client
    2. would have failed in any case
    3. Some tinkering I need to do to the fresh install.

    The reason I am very interested in these WU's is because it looks like they all ran long enough to complete when compared to the successful results overnight. Also when WU's run and don't error within the first few seconds it usually means there is some other reason for the failure than an incorrect setup on my machine. Rapid failures usually means there is a faulty setup on the host which is what was happening to the majority of WU's pre 27th May.

    BOINC 7.05.25 installed
    GPU (which should not affect this project) GTX 560Ti Driver version 285.66
    Any other system information you would need to identify where the fault could be?

    WU's Affected
    http://www.malariacontrol.net/result.php?resultid=126092034
    Exception: initialKappa is invalid
    OpenMalaria: Domain error
    08:18:20 (5372): called boinc_finish
    Run time 1,649.53
    CPU time 1,636.94


    http://www.malariacontrol.net/result.php?resultid=126088423
    Exception: initialKappa is invalid
    OpenMalaria: Domain error
    07:25:58 (3120): called boinc_finish
    Run time 2,075.11
    CPU time 2,058.09


    http://www.malariacontrol.net/result.php?resultid=126088386
    Exception: initialKappa is invalid
    OpenMalaria: Result too large
    06:59:21 (6072): called boinc_finish
    Run time 1,590.45
    CPU time 1,579.16


    http://www.malariacontrol.net/result.php?resultid=126088221
    Exception: initialKappa is invalid
    OpenMalaria: Result too large
    07:01:10 (200): called boinc_finish
    Run time 2,512.83
    CPU time 2,486.48


    http://www.malariacontrol.net/result.php?resultid=126076491
    Exception: initialKappa is invalid
    OpenMalaria: Domain error
    08:24:15 (1892): called boinc_finish
    Run time 1,569.06
    CPU time 1,559.59


    http://www.malariacontrol.net/result.php?resultid=126068476
    Exception: initialKappa is invalid
    OpenMalaria: Domain error
    02:10:22 (208): called boinc_finish
    Run time 1,868.88
    CPU time 1,807.47


    http://www.malariacontrol.net/result.php?resultid=126068345
    Exception: initialKappa is invalid
    OpenMalaria: Result too large
    01:59:42 (2284): called boinc_finish
    Run time 1,758.39
    CPU time 1,692.22


    http://www.malariacontrol.net/result.php?resultid=126061994
    Exception: initialKappa is invalid
    OpenMalaria: Result too large
    05:49:53 (5852): called boinc_finish
    Run time 1,765.53
    CPU time 1,751.42
    ____________
    The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

    Profile Ananas
    Send message
    Joined: Mar 7 06
    Posts: 58
    Credit: 752,054
    RAC: 0
    Message 18958 - Posted 27 May 2012 9:01:07 UTC - in response to Message 18957.

      The problem isn't on your side.

      The Knighty Ni
      Avatar
      Send message
      Joined: Aug 1 11
      Posts: 10
      Credit: 198,432
      RAC: 0
      Message 18959 - Posted 27 May 2012 9:04:28 UTC

        Thanks Ananas

        That puts my mind at rest :)
        ____________
        The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

        The Knighty Ni
        Avatar
        Send message
        Joined: Aug 1 11
        Posts: 10
        Credit: 198,432
        RAC: 0
        Message 18978 - Posted 27 May 2012 16:25:21 UTC

          Another one has turned up the same as some of the last ones:

          http://www.malariacontrol.net/result.php?resultid=126138102
          Exception: initialKappa is invalid
          OpenMalaria: Result too large
          17:13:46 (4272): called boinc_finish
          Run time 1,749.56
          CPU time 1,696.48
          ____________
          The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

          The Knighty Ni
          Avatar
          Send message
          Joined: Aug 1 11
          Posts: 10
          Credit: 198,432
          RAC: 0
          Message 18988 - Posted 28 May 2012 11:58:41 UTC

            Another handful of faulty results.

            If I could identify which are the ones which are failing and it is possible, I'll set preferences to stop those WU's coming onto my machine as the wasted crunching time is starting to mount up. Over 9 hours in total at present on runtime.

            http://www.malariacontrol.net/result.php?resultid=126162548
            Exception: initialKappa is invalid
            OpenMalaria: Result too large
            23:01:30 (5588): called boinc_finish
            Run time 1,977.03
            CPU time 1,834.50


            http://www.malariacontrol.net/result.php?resultid=126163225
            Exception: initialKappa is invalid
            OpenMalaria: Domain error
            23:38:30 (5252): called boinc_finish
            Run time 2,218.48
            CPU time 2,063.81


            http://www.malariacontrol.net/result.php?resultid=126163347
            Exception: initialKappa is invalid
            OpenMalaria: Domain error
            00:31:05 (3528): called boinc_finish
            Run time 2,396.42
            CPU time 2,222.31


            http://www.malariacontrol.net/result.php?resultid=126165102
            Exception: initialKappa is invalid
            OpenMalaria: Result too large
            01:07:10 (2560): called boinc_finish
            Run time 2,163.20
            CPU time 2,038.84


            http://www.malariacontrol.net/result.php?resultid=126169800
            Exception: initialKappa is invalid
            OpenMalaria: Result too large
            01:12:16 (5112): called boinc_finish
            Run time 1,610.58
            CPU time 1,564.02


            http://www.malariacontrol.net/result.php?resultid=126189087
            Exception: initialKappa is invalid
            OpenMalaria: Domain error
            09:34:16 (3020): called boinc_finish
            Run time 2,011.05
            CPU time 1,800.59


            http://www.malariacontrol.net/result.php?resultid=126214595
            Exception: initialKappa is invalid
            OpenMalaria: Result too large
            10:46:22 (3940): called boinc_finish
            Run time 3,041.52
            CPU time 2,692.58


            http://www.malariacontrol.net/result.php?resultid=126216754
            Exception: initialKappa is invalid
            OpenMalaria: Result too large
            11:34:42 (4548): called boinc_finish
            Run time 1,840.69
            CPU time 1,755.44
            ____________
            The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

            Neil
            Avatar
            Send message
            Joined: Dec 30 09
            Posts: 20
            Credit: 1,898,340
            RAC: 1,533
            Message 18994 - Posted 28 May 2012 18:11:53 UTC - in response to Message 18988.



              Hi Nighty,

              My RAC (Recent Average Credit) used to be about 3500.

              Looks like the errors started around May 22, and my RAC started turning south around May 25. Now I'm down to 2945.

              I went to my Malariacontrol Account page (http://www.malariacontrol.net/home.php), clicked on Tasks, found a Work Unit that's listed as "Error while computing," and clicked on the "Work unit click for details" column for that erroneous Work Unit.

              That opened the webpage for that particular work unit (http://www.malariacontrol.net/workunit.php?wuid=69251915). It shows that one of my computers worked on that Work Unit, and so did 4 other computers none of which are mine. Everyone's status is "Error while computing."

              I guess it's good to see that the problem is not in our computers, but something screwy with the work units (or their validation, or...).

              Yeh, lots of wasted computing time. I hope Malariacontrol quickly recognizes what's going on and gets it straightened out. I'm never going to be able to take over the world if my Work Units keep getting thrown out!

              I'll check back here to see if you come up with a strategy for identifying and aborting faulty Work Units.

              I've been wondering for a few months why my RAC rollercoasters between 3000 and 4000. It seemed like too much variation. I'll bet these errors have been popping up for a while.

              Best luck,
              -neil-
              Member of team Flying Sams
              Scranton, PA, where it's gray and dreary even on the sunniest of days
              We have to work four times harder, because we'z only got Celerons

              ____________

              The Knighty Ni
              Avatar
              Send message
              Joined: Aug 1 11
              Posts: 10
              Credit: 198,432
              RAC: 0
              Message 18999 - Posted 28 May 2012 20:37:22 UTC

                Last modified: 28 May 2012 20:38:25 UTC

                Hi Neil

                Thanks for the support. If I manage to work out how to identify the bad WU's from the others I'll post here.

                However, from this post seems we need to avoid anything that is not openMalariaA :
                http://www.malariacontrol.net/forum_thread.php?id=1276

                Going to test the theory out before my team decide on the project of the month otherwise there could be a bunch of big hitters wasting cycles.
                -------------------------------------------------------------------------------

                More of the Same and Some Sum's

                Total Run time lost in Seconds = 35,473.81 Hours 9.85
                Total CPU time lost in Seconds = 33,881.88 Hours 9.41
                Cost in electricity for wasted time = about £0.54 pence.

                May not seem like much in terms of cost. However, work it out over a year and that's £98.55 per annum burned for nothing.

                4 of my 6 CPU's on this project gives 2.5 hours per CPU over about 45 hours which equals about 20% of the time wasted in processing these WU's.

                How many other volunteers are experiencing the same?
                What is the total cost across all volunteers?

                Anyway that's my moan out of the way.

                More of the same errors as earlier posts

                http://www.malariacontrol.net/result.php?resultid=126219772
                Exception: initialKappa is invalid
                OpenMalaria: Result too large
                14:49:38 (7096): called boinc_finish
                Run time 4,030.39
                CPU time 3,724.27


                http://www.malariacontrol.net/result.php?resultid=126253334
                Exception: initialKappa is invalid
                OpenMalaria: Result too large
                18:12:46 (6844): called boinc_finish
                Run time 1,675.50
                CPU time 1,641.94

                Edited to include URL tags :)
                ____________
                The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

                michaelT
                Volunteer moderator
                Project administrator
                Project developer
                Project scientist
                Send message
                Joined: Jul 20 10
                Posts: 48
                Credit: 16,359
                RAC: 0
                Message 19005 - Posted 29 May 2012 7:56:47 UTC - in response to Message 18999.

                  Hi guys,

                  Thanks for the info...

                  We're trying to find the source of the problem. It could be that one of the parameters which is automatically generated using a genetic algorithm is out of the boundaries, so the human infectivity is too low and workunits are crashing.

                  The automatic generation of new workunits for those particular cases have been disabled for Branch B and Test until the problem is solved... The rest is still ok.

                  We will let you now as soon as the problem is solved. :)

                  Neil
                  Avatar
                  Send message
                  Joined: Dec 30 09
                  Posts: 20
                  Credit: 1,898,340
                  RAC: 1,533
                  Message 19006 - Posted 29 May 2012 8:24:43 UTC - in response to Message 19005.

                    Hi Michael T,

                    Glad you're on it.

                    I just looked at all of my Tasks that were still available for looking at (http://www.malariacontrol.net/results.php?userid=57156)

                    The past three days, tasks between 2000 to 10,000 seconds completed. But in half the tasks, credits granted was about 15(!). The other half of the tasks crunched as long, but ended in Errors and no credit.

                    There hasn't been a single properly completed task for as far back as the records go.

                    Anyway, I hope that's old news, soon to be back to normal.

                    Thanks for your work. Waiting for your autobiography,

                    -neil-

                    m
                    Send message
                    Joined: May 29 08
                    Posts: 4
                    Credit: 130,311
                    RAC: 14
                    Message 19007 - Posted 29 May 2012 9:07:21 UTC - in response to Message 19005.

                      Thanks for the feedback. Had me worried for a bit!

                      John.

                      Neil
                      Avatar
                      Send message
                      Joined: Dec 30 09
                      Posts: 20
                      Credit: 1,898,340
                      RAC: 1,533
                      Message 19008 - Posted 29 May 2012 10:06:04 UTC - in response to Message 19007.


                        Hi M.

                        Don't worry; be happy. Like Michael T said:

                        > It could be that one of the parameters which is automatically generated using a genetic algorithm is out of the boundaries, so the human infectivity is too low and workunits are crashing. The automatic generation of new workunits for those particular cases have been disabled for Branch B and Test until the problem is solved.

                        I'm somewhat worried about one thing, though. Michael also wrote:

                        > ... The rest is still ok.

                        I don't know what he's referring to. I can't find a single Task that has ended "ok" in the past few days.

                        I've stopped my few computers from downloading any more work, and aborted all the tasks they were working on and were waiting to work on.

                        I wish I could post an attachment to show the graphic of my Recent Average Credit tanking. Of course, the real victim is Malariacontrol progress.

                        Best luck,
                        -neil-

                        The Knighty Ni
                        Avatar
                        Send message
                        Joined: Aug 1 11
                        Posts: 10
                        Credit: 198,432
                        RAC: 0
                        Message 19012 - Posted 29 May 2012 11:41:45 UTC

                          Thanks MichaelT

                          Look forward to when the issue is resolved.

                          In the meantime I'll keep munching up the WU's that are available.

                          Kind regards
                          The Knighty Ni.

                          P.S. If there is any information you need regarding my rig let me know and I'll PM it to you.
                          ____________
                          The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

                          michaelT
                          Volunteer moderator
                          Project administrator
                          Project developer
                          Project scientist
                          Send message
                          Joined: Jul 20 10
                          Posts: 48
                          Credit: 16,359
                          RAC: 0
                          Message 19031 - Posted 30 May 2012 10:28:38 UTC - in response to Message 19008.

                            Hi Neil,

                            > I don't know what he's referring to. I can't find a single Task that has ended "ok" in the past few days.

                            To explain, we have several "runs" and each run generates workunits. Last thursday, we created new runs and pushed the priority to be higher than the old runs which were already there and working fine. Those runs intend to find the best parameters for our mathemical models which are used in the application.

                            At the end, some of the new runs, seems to have problems. We suppose that maybe we're in a bad combination of parameters space which couldn't be predicted. But we're still investigating.

                            We have disabled the problematic runs yesterday, so now the old runs should generate correct workunits but it could be that few of them are still in the pipe. Let me know if it keep crashing.

                            The Knighty Ni
                            Avatar
                            Send message
                            Joined: Aug 1 11
                            Posts: 10
                            Credit: 198,432
                            RAC: 0
                            Message 19038 - Posted 30 May 2012 17:42:37 UTC

                              I'll be interested to see what the return on this one will be as it has already run for over 46 hours. By far the longest running WU from Malaria Control so far.

                              http://www.malariacontrol.net/result.php?resultid=126204401

                              It's still got another 1.5 hrs to run which will bring it up to a total of about 48 hrs runtime.

                              On a slower machine say at 2.66Ghz the same WU would take about 55.5 hours to complete.

                              Reality is I prefer longer running WU's. The longest one ever, was benchmarked at 1675 hours from another project on one of my former machines running at 2.2Ghz about 3 years ago. Nice and steady, trickle up, credit building WU :)

                              Oh yes. Thanks Michael. Haven't had any WU error since the 29th very early in the morning. Total erroring WU's was about 31-32 that ran for either full term or almost to full term. The last 2-3 I didn't post here as you are onto the problem.

                              Really appreciate what you are doing and it make me want to hang around longer to help. :)
                              ____________
                              The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

                              The Knighty Ni
                              Avatar
                              Send message
                              Joined: Aug 1 11
                              Posts: 10
                              Credit: 198,432
                              RAC: 0
                              Message 19132 - Posted 7 Jun 2012 15:44:01 UTC

                                Well its been about 10 days now and certainly very pleased that no more errors have been produced.

                                Now running around 200 + WU's daily and the only error has been one the server cancelled Lol.

                                So very happy about this and proves the rig is stable. Now maybe its time to start O/Cing it again back to where it was prior the rig bugging out on all projects.

                                Tested all of the projects over the last few days with minimal errors all round.

                                Thanks for taking the time to look into this MichaelT and other project staff who have helped. :)
                                ____________
                                The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

                                Post to thread

                                Message boards : Number crunching : Errors Overnight


                                Return to malariacontrol.net main page


                                Copyright © 2013 africa@home