all linux jobs failed


Advanced search

Message boards : Number crunching : all linux jobs failed

AuthorMessage
nairb
Send message
Joined: Aug 10 08
Posts: 4
Credit: 78,954
RAC: 124
Message 13198 - Posted 5 Jul 2010 23:54:20 UTC

    All the jobs for my linux (fedora 8) have failed.
    Most with ......

    <core_client_version>6.10.17</core_client_version>
    <![CDATA[
    <message>
    process exited with code 255 (0xff, -1)
    </message>
    <stderr_txt>
    Warning: ESCaseManagement: decision case is unused (for uncomplicated tree)
    Exception: effectiveEIR is not finite: nan

    Looks like linux is no good.
    Nairb

    Bernie O
    Avatar
    Send message
    Joined: Jun 16 10
    Posts: 1
    Credit: 153
    RAC: 0
    Message 13269 - Posted 15 Jul 2010 5:04:15 UTC - in response to Message 13198.

      Was this ever resolved? Running Linux Mint over here
      ____________
      www.cambodianboxing.com

      cisf
      Send message
      Joined: Oct 21 09
      Posts: 8
      Credit: 25,168
      RAC: 0
      Message 13270 - Posted 15 Jul 2010 7:20:36 UTC

        works very well on ubuntu64, probably nairb errors are not application related.

        if someone knows what error 255 means, that could be of help.

        hardy
        Volunteer moderator
        Project administrator
        Project developer
        Avatar
        Send message
        Joined: Feb 18 09
        Posts: 142
        Credit: 56,936
        RAC: 6
        Message 13278 - Posted 16 Jul 2010 9:23:02 UTC - in response to Message 13270.

          255 is the default exit code for openmalaria errors (i.e. many causes). The error is:

          Exception: effectiveEIR is not finite: nan

          which doesn't have a very clear cause. Have you overclocked (floating-point errors may occur before integer errors)? It's not the only cause though. My machine (also 64-bit linux) usually works fine, but I've also had the occasional floating-point error.

          nairb
          Send message
          Joined: Aug 10 08
          Posts: 4
          Credit: 78,954
          RAC: 124
          Message 13282 - Posted 16 Jul 2010 11:45:21 UTC

            Thanks for the replies.
            None of the machines are overclocked. All are standard with a clean install of
            Fedora core 8. (1 is fedora 4 so may fail because of this)
            Its all standard software.
            All the machines are doing multiple projects (seti/rosetta/einstein/others) without any faults. All run 24*7

            I have just downloaded workunits to all the fedora core 8 machines again.
            All the machines failed with the same error as before.

            May be an issue with fc8.

            Ta
            Nairb

            cisf
            Send message
            Joined: Oct 21 09
            Posts: 8
            Credit: 25,168
            RAC: 0
            Message 13286 - Posted 16 Jul 2010 20:26:36 UTC

              i had some time to burn, so i downloaded a live fc8, and started it in a virtualmachine. Kernel 2.6.23.1-42.fc8 boinc 6.10.56. Did one WU, and it worked:

              http://www.malariacontrol.net/show_host_detail.php?hostid=164896

              will have to see if it gets validated or not ....

              so it doesn't look like an fc8 problem ...but i must say fc8 is very old .... if there is no reason to keep that version maybe an update would make the error go away

              hardy
              Volunteer moderator
              Project administrator
              Project developer
              Avatar
              Send message
              Joined: Feb 18 09
              Posts: 142
              Credit: 56,936
              RAC: 6
              Message 13313 - Posted 19 Jul 2010 10:41:57 UTC - in response to Message 13286.

                so it doesn't look like an fc8 problem ...but i must say fc8 is very old .... if there is no reason to keep that version maybe an update would make the error go away

                Well, I was going to say that doesn't seem likely... but it is rather strange to have 100% of your systems failing these computers.

                @nairb, would you perhaps be able to upgrade one of your FC8 systems and try running a few more work-units, and/or run a few on your FC4 system (http://www.malariacontrol.net/show_host_detail.php?hostid=108108) ?

                We're somewhat intrigued here... they are also all Athlon computers so it could even be the CPU and OS combination leading to these errors. (I presume though there are other causes of this effectiveEIR error.)

                Thanks for reporting.

                nairb
                Send message
                Joined: Aug 10 08
                Posts: 4
                Credit: 78,954
                RAC: 124
                Message 13316 - Posted 19 Jul 2010 22:04:11 UTC

                  So..... I fired up my trusty dual pentium 3 machine with fc10 already installed.
                  This has done some malaria work before... Downloaded 1 workunit and it failed.

                  So I wondered if its a dual cpu problem

                  So I fired up an intel (pentium 3 coppermine 1ghz) single cpu fc10. Downloaded 1 workunit. And it Failed with the same message as the others.

                  So ... both fc10 and 8 fail and dual intel/amd & single intel all fail.
                  My windoz xp works fine.

                  But cisf has managed to do a w/u with the same kernal (2.6.23.1-42.fc8) as my fc8 machines.

                  I do have Ubunto linux but no spare disk to put it on.

                  I realy am sure that all these machines have done Malatia work ok before.

                  Dont know what else to do. But they continue to do work for other projects fine.

                  Nairb

                  Profile Krunchin-Keith [USA]
                  Volunteer moderator
                  Volunteer tester
                  Avatar
                  Send message
                  Joined: Nov 10 05
                  Posts: 3319
                  Credit: 5,591,819
                  RAC: 422
                  Message 13329 - Posted 22 Jul 2010 10:47:35 UTC

                    I would suggest to at least try to update some of your clients' (5.10.28, 6.4.5, 6.6.29) boinc versions. This may not be the cause but using old version client sometimes is. Try the newer 6.10.xx versions

                    nairb
                    Send message
                    Joined: Aug 10 08
                    Posts: 4
                    Credit: 78,954
                    RAC: 124
                    Message 13331 - Posted 22 Jul 2010 18:33:23 UTC

                      Updated 2 machines (fc8 & fc10)to the latest linux 86 client version - 6.10.56

                      Detached and re-attached machines. Downloaded everything fine......

                      All failed again.

                      Think I will have to stick to the 1 win XP machine.

                      Nairb

                      Profile Burger
                      Send message
                      Joined: Aug 21 10
                      Posts: 6
                      Credit: 77
                      RAC: 0
                      Message 13582 - Posted 26 Aug 2010 6:20:08 UTC - in response to Message 13331.

                        Hello, I also have this problem with an Athlon AMD Athlon(tm) XP 2000+ [Family 6 Model 6 Stepping 2] and Ubuntu (newest version 10), http://www.malariacontrol.net/show_host_detail.php?hostid=167563

                        Is there a way to execute the work unit manually, without boinc started? Because on my machine the new WU only runs for 2 seconds, then flags this task as error and immediately gets new tasks. Maybe a library is missing or the binary is not found?

                        I have also installed windows XP on the same machine, and there the computation runs normally. But here I see that almost all (but not all) tasks are marked as "invalid" or "validation inconclusive": http://www.malariacontrol.net/results.php?userid=66742.

                        I have just added a second computer with more recent Intel Core Duo processor, and also one task is "inconclusive" http://www.malariacontrol.net/result.php?resultid=60185396.

                        So my conclusion is, that the algorithm used in computation is too instable. If this continues, I will use the computers for different projects.


                        A wish: it would be easier if the host name was listed in the table "All tasks for USER".[/b]

                        Regards, Karsten

                        Profile Burger
                        Send message
                        Joined: Aug 21 10
                        Posts: 6
                        Credit: 77
                        RAC: 0
                        Message 13585 - Posted 26 Aug 2010 6:32:18 UTC - in response to Message 13582.

                          ... I forgot to say that the Intel computer "savitri" I mentioned works fine with other projects.

                          Profile Burger
                          Send message
                          Joined: Aug 21 10
                          Posts: 6
                          Credit: 77
                          RAC: 0
                          Message 13624 - Posted 29 Aug 2010 19:50:53 UTC - in response to Message 13585.

                            I have in fact quit the project, since almost all the tasks are invalid, see http://www.malariacontrol.net/results.php?userid=66742.

                            Profile mikey
                            Avatar
                            Send message
                            Joined: Mar 23 07
                            Posts: 4701
                            Credit: 5,420,244
                            RAC: 402
                            Message 13627 - Posted 30 Aug 2010 10:29:51 UTC - in response to Message 13624.

                              I have in fact quit the project, since almost all the tasks are invalid, see http://www.malariacontrol.net/results.php?userid=66742.


                              Sorry to see you go, I hope you did not stop crunching altogether though!

                              P . P . L .
                              Avatar
                              Send message
                              Joined: Aug 27 08
                              Posts: 56
                              Credit: 500,976
                              RAC: 0
                              Message 13655 - Posted 2 Sep 2010 8:43:19 UTC

                                Hi.

                                So are the linux app's still playing up, i was going to come and play with my new toy a AMD x6 with Ubuntu 10.04lts x64 on it.

                                So tell me is it good to go or not.

                                ____________

                                hardy
                                Volunteer moderator
                                Project administrator
                                Project developer
                                Avatar
                                Send message
                                Joined: Feb 18 09
                                Posts: 142
                                Credit: 56,936
                                RAC: 6
                                Message 13660 - Posted 2 Sep 2010 9:32:31 UTC

                                  I'm sorry to hear that, Burger, but I guess it does make sense to move to another project if you're having no success here. I think we need to do a little more work on validation since slight differences in floating-point computation occur on many platforms, and in our simulations often end up causing quite big differences.

                                  I can't actually say whether yours will work, PPL. The majority of linux hosts are successfully returning results; I'm not sure how many of these get marked as invalid.

                                  Btw, if anyone else has a linux-related problem, could you start a new thread please? There's already three separate issues in this one!

                                  Post to thread

                                  Message boards : Number crunching : all linux jobs failed


                                  Return to malariacontrol.net main page


                                  Copyright © 2013 africa@home