Unusual result (there IS a problem now!)


Advanced search

Message boards : Number crunching : Unusual result (there IS a problem now!)

AuthorMessage
Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 1
Message 18930 - Posted 26 May 2012 12:11:17 UTC

    wu_2952_417_742309_0_1337981624_0 is by far the longest running openMalaria result I have ever seen.

    67% after 14 hours, still using CPU time and there's still progress, I just hope that no rsc_fpops_bound setting will kill it.

    TheFiend
    Send message
    Joined: Nov 5 09
    Posts: 3
    Credit: 338,993
    RAC: 2
    Message 18942 - Posted 26 May 2012 15:32:08 UTC - in response to Message 18930.

      I have one just like that..... currently 42% after 9.5 hours running on my 1090T

      Profile Ananas
      Send message
      Joined: Mar 7 06
      Posts: 58
      Credit: 752,054
      RAC: 1
      Message 18951 - Posted 26 May 2012 21:44:39 UTC - in response to Message 18930.

        Last modified: 26 May 2012 21:46:24 UTC

        No trouble with any runtime limits :-)

        CPU time 85,487.20 (not validated yet) - not a fast CPU, running at 2.27GHz + HT

        TheFiend
        Send message
        Joined: Nov 5 09
        Posts: 3
        Credit: 338,993
        RAC: 2
        Message 18952 - Posted 26 May 2012 22:15:41 UTC - in response to Message 18951.

          My 1090T is clocked to 3.75GHx......

          Now 57% after 16 hours... All other WU's been running as normal

          Profile Ananas
          Send message
          Joined: Mar 7 06
          Posts: 58
          Credit: 752,054
          RAC: 1
          Message 18956 - Posted 27 May 2012 1:17:58 UTC

            Last modified: 27 May 2012 1:25:25 UTC

            I have just finished one more of those :

            wu_2952_317_742484_0_1337992814_0

            CPU time 67,505.73
            Validate state Valid
            Credit 0.00

            What kind of nasty crap is this again? A new misconcept of the Berkeley despot? Some idiotic anti-cheating failure?

            Setting to NNW + aborting all unstarted results :-(

            TheFiend
            Send message
            Joined: Nov 5 09
            Posts: 3
            Credit: 338,993
            RAC: 2
            Message 18965 - Posted 27 May 2012 11:23:14 UTC

              I also had a second one...... was 40% after 10 hours...

              Decided to abort both.

              :(

              Profile mikey
              Avatar
              Send message
              Joined: Mar 23 07
              Posts: 4700
              Credit: 5,420,171
              RAC: 405
              Message 18971 - Posted 27 May 2012 14:44:16 UTC - in response to Message 18965.

                I also had a second one...... was 40% after 10 hours...

                Decided to abort both.

                :(


                I had one and it went for almost 24 hours! I ONLY do the A units as this is an older laptop and the A units are supposed to be SMALLER!!!

                Sparky_140
                Send message
                Joined: May 25 11
                Posts: 1
                Credit: 86,644
                RAC: 47
                Message 18987 - Posted 28 May 2012 11:41:06 UTC

                  Have a WU that's been plugging away for 15hrs but the estimated completion time keeps on going UP -- now estimated at an additional 24hrs and climbing ... is this reasonable (delivery deadline is May 31st -- unlikely that I'll make that if the remaining time continues to climb).
                  For the record it's wu_2953_743526_0_1338070808_0

                  Profile Ananas
                  Send message
                  Joined: Mar 7 06
                  Posts: 58
                  Credit: 752,054
                  RAC: 1
                  Message 19000 - Posted 28 May 2012 21:07:25 UTC

                    Last modified: 28 May 2012 21:10:25 UTC

                    My longest has been 48 hours on a slightly OC'ed C2Q 9450, it's currently "inconclusive" (there seems to be a Linux vs. Windows issue but some misguided weirdo doesn't let me see the wingmen).

                    All other long ones have been valid and - no matter if there was a co-victim or I had it for my own - received 0.00 credits. A total of about 6 CPU days lost. Fortunately I had aborted the rest.

                    I guess that this is an anti-cheating algorithm that is definitely in the wrong place here.


                    p.s.: I had quite a hard time keeping my finger away from the "detach" button :-/

                    RandyC
                    Avatar
                    Send message
                    Joined: Jun 23 06
                    Posts: 3166
                    Credit: 980,045
                    RAC: 900
                    Message 19022 - Posted 30 May 2012 0:40:25 UTC

                      Last modified: 30 May 2012 0:41:19 UTC

                      Since MCN for me is normally a set-and-forget project, I don't normally monitor my WUs. But after seeing a post in another thread I got to checking my systems...http://www.malariacontrol.net/workunit.php?wuid=69287156 ran for 141,555.30 cpu secs, no errors; validated; credit = ZERO!!!

                      michaelT
                      Volunteer moderator
                      Project administrator
                      Project developer
                      Project scientist
                      Send message
                      Joined: Jul 20 10
                      Posts: 47
                      Credit: 16,359
                      RAC: 0
                      Message 19028 - Posted 30 May 2012 9:49:30 UTC

                        Regarding the non credit issue the problem :

                        It seems that this is due to some issues with the validator : there is a MAX_GRANTED_CREDIT parameter which should in theory grant MAX_GRANTED_CREDIT (it avoids cheating with high credit request) if WU_CREDIT > MAX_GRANTED_CREDIT but in our case it granted 0 credit ... :(

                        Some of the 0 granted workunits have already been purged but we manage to get all the hosts and the average credits for all those ones. So for the one who didn't get credit before it's fixed now.

                        We increased the MAX_GRANTED_CREDIT like that this should not be a problem anymore. But let me know if it happen again.

                        jdvb
                        Send message
                        Joined: Apr 4 11
                        Posts: 19
                        Credit: 4,295,847
                        RAC: 5,777
                        Message 19037 - Posted 30 May 2012 17:06:55 UTC

                          Thank you for your reply!

                          But what about the extreme runtimes of 100+ hours?
                          WU's are to be returned within 3 days, that's not possible when they run so extremely long.
                          The linked one I aborted after 106 hours and progress @ about 90%

                          But as we only get 3 days to process tasks, 5 is stretching it a bit too far.
                          Sadly I had to set NNW untill either the running time is greatly decreased or the time alowed greatly increased.

                          Max execution time on the given system was around 2 hours, these run for 5 days (120 hours). That is about 60 times longer runningtime!
                          Please either make tasks so that they are once again finished within 3 hours or increase the time alowed to process to at least a month.

                          (a queue can build up when boinc thinks the average time is around 2 hours but instead each task runs for 5 days!)

                          The Knighty Ni
                          Avatar
                          Send message
                          Joined: Aug 1 11
                          Posts: 10
                          Credit: 198,432
                          RAC: 0
                          Message 19039 - Posted 30 May 2012 17:54:07 UTC - in response to Message 19037.

                            @jdvb

                            (a queue can build up when boinc thinks the average time is around 2 hours but instead each task runs for 5 days!)


                            I know exactly what you mean. There are other projects where this happens. The funny thing is to me it feels like they are spamming because eventually you become flooded with WU's which are all running at High Priority. Not a happy situation when you are trying to be fair to each project with your resources. Just doesn't give other projects a chance.

                            Usually manage this by setting No new tasks for a period of time depending on the WU run times for each project.

                            Don't you just hate having to Micro Manage in this way :)
                            ____________
                            The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

                            jdvb
                            Send message
                            Joined: Apr 4 11
                            Posts: 19
                            Credit: 4,295,847
                            RAC: 5,777
                            Message 19042 - Posted 30 May 2012 18:58:00 UTC - in response to Message 19039.

                              Last modified: 30 May 2012 18:58:53 UTC

                              Don't you just hate having to Micro Manage in this way :)
                              Then do tell me, how do I manage a 5 day workunit to fit into a 3 day period?
                              Yes, I do find WU's that are simply ignored due to being turned in too late.

                              I generally only run one project on any PC.
                              No managing at all exept when tasks last longer then the max time alowed to execute.
                              The managing then involves aborting and moving to a different project as I see no other option.

                              This is not being flooded with WU's when one WU is too much on a multicore system.
                              I have stopped all malariacontrol on all machines slower then i7's as they get WU's longer then 50 hours.
                              Not going to choose to waste time on WU's that I can't turn in before the deadline anyways. This needs fixing asap.

                              Strat
                              Send message
                              Joined: Apr 8 12
                              Posts: 6
                              Credit: 13,118
                              RAC: 191
                              Message 19088 - Posted 3 Jun 2012 10:45:18 UTC

                                So I guess these are the last jobs I do for Malariacontol.net seeing how I got ZERO credit.


                                Name wu_2952_24_741957_0_1337959442_2
                                Workunit 69267532
                                Created 30 May 2012 23:24:01 UTC
                                Sent 30 May 2012 23:26:25 UTC
                                Received 2 Jun 2012 21:45:32 UTC
                                Server state Over
                                Outcome Computation error
                                Client state Compute error
                                Exit status -177 (0xffffff4f)
                                Computer ID 554761
                                Report deadline 3 Jun 2012 10:46:25 UTC
                                Run time 240,684.84
                                CPU time 226,948.60
                                Validate state Invalid
                                Credit 0.00
                                Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58


                                Name wu_2953_232_743346_0_1338056246_2
                                Workunit 69362442
                                Created 30 May 2012 5:54:35 UTC
                                Sent 30 May 2012 6:01:51 UTC
                                Received 2 Jun 2012 10:03:54 UTC
                                Server state Over
                                Outcome Computation error
                                Client state Compute error
                                Exit status -177 (0xffffff4f)
                                Computer ID 554761
                                Report deadline 2 Jun 2012 17:21:51 UTC
                                Run time 249,103.84
                                CPU time 230,974.40
                                Validate state Invalid
                                Credit 0.00
                                Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58

                                michaelT
                                Volunteer moderator
                                Project administrator
                                Project developer
                                Project scientist
                                Send message
                                Joined: Jul 20 10
                                Posts: 47
                                Credit: 16,359
                                RAC: 0
                                Message 19098 - Posted 4 Jun 2012 9:53:00 UTC - in response to Message 19088.

                                  First of all, Strat this is a public forum, so be gentle and don't be coarse (your previous post is now hidden ).
                                  Then for you credit, as you can see the valid state field of your workunits are Invalid, this mean that you won't get any credits for those workunits.

                                  Then as I explained in some previous post we're sorry for the messup which happened last week, we tried hard fixing problems and finding where it went wrong. We also granted credits to people who didn't get them because of too high credits and canceled workunits we identifed as corrupted.

                                  So again sorry for the mess and thanks for yours understanding.

                                  Profile mikey
                                  Avatar
                                  Send message
                                  Joined: Mar 23 07
                                  Posts: 4700
                                  Credit: 5,420,171
                                  RAC: 405
                                  Message 19099 - Posted 4 Jun 2012 10:26:23 UTC - in response to Message 19098.

                                    First of all, Strat this is a public forum, so be gentle and don't be coarse (your previous post is now hidden ).
                                    Then for you credit, as you can see the valid state field of your workunits are Invalid, this mean that you won't get any credits for those workunits.

                                    Then as I explained in some previous post we're sorry for the messup which happened last week, we tried hard fixing problems and finding where it went wrong. We also granted credits to people who didn't get them because of too high credits and canceled workunits we identifed as corrupted.

                                    So again sorry for the mess and thanks for yours understanding.


                                    MichaelT I too am STILL getting the REALLY long units, I aborted one that had run for over 24 hours and was still at 50% just this morning. My normal time frame is 1.5 hours or less. The units in question seem to be the ones starting with 2952.

                                    michaelT
                                    Volunteer moderator
                                    Project administrator
                                    Project developer
                                    Project scientist
                                    Send message
                                    Joined: Jul 20 10
                                    Posts: 47
                                    Credit: 16,359
                                    RAC: 0
                                    Message 19103 - Posted 4 Jun 2012 12:53:37 UTC - in response to Message 19099.

                                      Yes, mikey the all the long workunits are the wu_2952_* and wu_2953_*, you can abort them, they have been cancel.

                                      Strat
                                      Send message
                                      Joined: Apr 8 12
                                      Posts: 6
                                      Credit: 13,118
                                      RAC: 191
                                      Message 19107 - Posted 4 Jun 2012 14:13:44 UTC - in response to Message 19098.

                                        Well thats just fine by me cause I'm outa here!

                                        Profile mikey
                                        Avatar
                                        Send message
                                        Joined: Mar 23 07
                                        Posts: 4700
                                        Credit: 5,420,171
                                        RAC: 405
                                        Message 19114 - Posted 5 Jun 2012 11:19:15 UTC - in response to Message 19103.

                                          Yes, mikey the all the long workunits are the wu_2952_* and wu_2953_*, you can abort them, they have been cancel.


                                          THANK YOU for cancelling them! That means all units I now crunch are ones that will bring my rac back UP!!

                                          Post to thread

                                          Message boards : Number crunching : Unusual result (there IS a problem now!)


                                          Return to malariacontrol.net main page


                                          Copyright © 2013 africa@home