Was looking at IO writes...


Advanced search

Message boards : Number crunching : Was looking at IO writes...

AuthorMessage
paperdragon
Avatar
Send message
Joined: Jun 22 06
Posts: 6
Credit: 59,381
RAC: 0
Message 1602 - Posted 16 Nov 2006 18:25:51 UTC

    Looking in Windows XP TaskMgr, I noticed, that compared to other BOINC projects, Malaria was doing alot of IO writes.

    The current work unit as done, so far, in 3 hours (92% complete) 41.7million IO wrtes, and 1.87 Billion IO write bytes. I was just wondering, what is all the IO it is doing?
    ____________


    You like Myst? Uru Live returns! www.urulive.com

    ksba
    Send message
    Joined: Jul 13 06
    Posts: 31
    Credit: 13,981,436
    RAC: 179
    Message 1603 - Posted 16 Nov 2006 18:36:24 UTC - in response to Message 1602.

      Last modified: 16 Nov 2006 18:37:20 UTC

      compared to other BOINC projects, Malaria was doing alot of IO writes.


      It's the "same" thing i write down here.
      'The Client made too many traffic over the internal LAN.'


      ____________

      PaperDragon
      Send message
      Joined: Jun 26 06
      Posts: 5
      Credit: 645,251
      RAC: 909
      Message 1644 - Posted 30 Nov 2006 16:17:54 UTC

        Last modified: 30 Nov 2006 16:19:36 UTC

        Curiousity got the better of me. So I downloaded 'Process Monitor', the replacement to Filemon/Regmon, to see what all the writes where.

        This is one of the most aggressive checkpoint writers I have seen. It is writing a checkpoint update every .0025 seconds. It normally writes out 28 bytes at a time, but every 10th record seems to be 144 bytes.

        Inserted below are 11 of the records being written:
        "40","9:59:33.0033661 AM","malariacontrol_...","WriteFile","...checkpoint0","Offset: 5,208,212, Length: 144"
        "41","9:59:33.0036398 AM","malariacontrol_...","WriteFile","..checkpoint0","Offset: 5,208,356, Length: 28"
        "42","9:59:33.0039058 AM","malariacontrol_...","WriteFile","..checkpoint0",""Offset: 5,208,384, Length: 28"
        "43","9:59:33.0041720 AM","malariacontrol_...","WriteFile","..checkpoint0","Offset: 5,208,412, Length: 28"
        "44","9:59:33.0044377 AM","malariacontrol_...","WriteFile","..checkpoint0",""Offset: 5,208,440, Length: 28"
        "45","9:59:33.0047092 AM","malariacontrol_...","WriteFile","..checkpoint0",""Offset: 5,208,468, Length: 28"
        "46","9:59:33.0049758 AM","malariacontrol_...","WriteFile","..checkpoint0",""Offset: 5,208,496, Length: 28"
        "47","9:59:33.0052414 AM","malariacontrol_...","WriteFile","..checkpoint0","Offset: 5,208,524, Length: 28"
        "48","9:59:33.0055121 AM","malariacontrol_...","WriteFile","..checkpoint0",""Offset: 5,208,552, Length: 28"
        "49","9:59:33.0057789 AM","malariacontrol_...","WriteFile","..checkpoint0","Offset: 5,208,580, Length: 28"
        "50","9:59:33.0060471 AM","malariacontrol_...","WriteFile","..checkpoint0","Offset: 5,208,608, Length: 144"
        ____________

        Profile Evil-Dragon
        Avatar
        Send message
        Joined: Jun 20 06
        Posts: 4
        Credit: 10,754
        RAC: 0
        Message 1655 - Posted 2 Dec 2006 17:05:47 UTC

          No wonder i notice alot of disk activity on one of my machines. I'm getting a lot of IO writes as well (27 million to be exact)

          Compared to BBC CCE: 293,000 IO writes.
          ____________

          Profile [BAT]bigbee
          Send message
          Joined: Mar 8 06
          Posts: 1
          Credit: 71,314
          RAC: 0
          Message 1698 - Posted 22 Dec 2006 18:21:11 UTC

            Hmmmmzzz this may explain the total crash of my pc\'s HD.
            I always was fond of this project... but this puts it into another perspective for me :-/
            ____________

            I crunch belgian style

            Redbull
            Send message
            Joined: Mar 8 06
            Posts: 28
            Credit: 307,058
            RAC: 0
            Message 1708 - Posted 25 Dec 2006 19:19:05 UTC - in response to Message 1698.

              Last modified: 25 Dec 2006 19:30:13 UTC

              I don\'t know the programmers on the project, except that marie, and the rest of the team, are obviously heavy dudes in the field of mathematical medical modelling and have managed to transfer the models to a working computer algorithm.

              I recently optimized an algorithm from a senior developer from 23 mins(!) down to a measly 7 seconds. 5,380,000 itterations.

              The flaw was due to simple things like re-caculating the length each loop iterration on a fixed length varible declarations, and not pre-allocating (yes in .NET!) a large enough working space. Other yummy things like over use of function calls, under use of the local stack, overly complex get routines on properties (also unnessecary calls to get routines), glared out like a dogs bollocks. (sheesh dont let me go on!)

              Maybe MCDN is calling a kinda special BOINC function which results in a write to the check point file too frequenly??

              However the figures PaperDragon quotes are wildy close enough to the 60hz frequency of the internal timer in the internals of BOINC (and which MCDN relies upon), to warrant a closer inspection.

              Quoting BOINC Wiki:


              The API implementation uses a timer (60Hz);
              the real-time clock is not available to applications.

              This timer is used for several purposes:
              * To tell the Science Application when to Checkpoint;
              * To regenerate the fraction done file
              * To refresh graphics


              Or using this timer interrupt inefficently? (How is the timer bound or dispatched to the BOINC Client in by the main BOINC Program??)

              Could it be writing a check point *EVERY* time it gets a timing pulse, or is it skipping a few (but not enough) times before it writes? (What is after all 30 seconds (a nice window) of computation time lost because you dont write often to save speed in the os?)

              Beware of the figures tools like Regmon, and the like, produce. They introduce delays of their own and the figures it produces are often buffered and or delayed. However they do provide a fantastic insight into a running application.

              A network sniffing session might produce some interesting profiles too..

              /me is eagerly awaiting possible sniffs at the source code.

              BOINC sounds like an interesting architecture. Rock on Finite State Machines!

              Merry Christmas All!

              Profile The Gas Giant
              Avatar
              Send message
              Joined: Mar 7 06
              Posts: 1214
              Credit: 3,729,241
              RAC: 278
              Message 1717 - Posted 28 Dec 2006 11:14:29 UTC

                Last modified: 28 Dec 2006 11:20:01 UTC

                I\'m seeing 57 million i/o writes in 1hr40min per app. That\'s 68 million i/o writes per hr on my hyperthreaded 3.2GHz machine. A little too many for mine!

                Come on Maire can you do something / anything about the number of i/o writes?

                Live long and BOINC.

                ____________
                Paul
                (S@H1 8888)

                Profile maire
                Volunteer moderator
                Project administrator
                Project developer
                Project scientist
                Send message
                Joined: Nov 7 05
                Posts: 439
                Credit: 118,258
                RAC: 0
                Message 1740 - Posted 3 Jan 2007 16:17:01 UTC

                  We\'re all back at work now, and ready to give this problem a closer look. Thank you for your input! I\'ll let you know as soon as we make progress or need further info.
                  Thanks
                  Nick
                  ____________
                  Nicolas Maire
                  Swiss Tropical and Public Health Institute
                  http://www.swisstph.ch

                  Jean-David Beyer
                  Send message
                  Joined: Jan 5 07
                  Posts: 18
                  Credit: 176,490
                  RAC: 98
                  Message 1815 - Posted 7 Jan 2007 5:32:53 UTC - in response to Message 1740.

                    We\'re all back at work now, and ready to give this problem a closer look. Thank you for your input! I\'ll let you know as soon as we make progress or need further info.
                    Thanks
                    Nick

                    My BOINC stuff all occurs in a partition used by no other applications. Looking at IO rates for this partition gives the following, using iostat in Linux:

                    Time: 00:23:13
                    avg-cpu: %user %nice %sys %iowait %idle
                    0.91 98.47 0.61 0.00 0.01

                    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
                    hda11 0.00 68.15 0.00 5.07 0.00 585.73 0.00 292.87 115.61 0.18 35.72 0.79 0.40

                    Time: 00:24:13
                    avg-cpu: %user %nice %sys %iowait %idle
                    1.54 97.54 0.89 0.00 0.03

                    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
                    hda11 0.00 54.59 0.00 4.23 0.00 470.59 0.00 235.29 111.18 0.17 39.72 0.87 0.37

                    Time: 00:25:13
                    avg-cpu: %user %nice %sys %iowait %idle
                    1.71 96.27 1.99 0.00 0.03

                    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
                    hda11 0.00 14.47 0.00 3.35 0.00 142.80 0.00 71.40 42.63 0.02 6.57 0.35 0.12

                    Time: 00:26:13
                    avg-cpu: %user %nice %sys %iowait %idle
                    2.49 93.46 3.63 0.00 0.42

                    Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
                    hda11 0.00 93.67 0.00 7.12 0.00 806.53 0.00 403.27 113.33 0.38 52.90 0.80 0.57

                    That is essentially no reading and around 200k bytes/second writing. For me, that is not a big deal, but it could be for others. I have 8 GBytes RAM, so those writes may just sit in the buffers for a while and not write to disk all that often. The drive for that partition is 7200 rpm EIDE with 8 Megabyte buffer.
                    ____________

                    Profile maire
                    Volunteer moderator
                    Project administrator
                    Project developer
                    Project scientist
                    Send message
                    Joined: Nov 7 05
                    Posts: 439
                    Credit: 118,258
                    RAC: 0
                    Message 1887 - Posted 12 Jan 2007 16:28:38 UTC - in response to Message 1602.

                      The current work unit as done, so far, in 3 hours (92% complete) 41.7million IO wrtes, and 1.87 Billion IO write bytes. I was just wondering, what is all the IO it is doing?


                      Checkpointing in our case means writing the complete status of the simulation model to disk, with the current workunits that would usually be between 12 and 25MB of data. Given your host seems to write about 10 checkpoints per hour, that would be up to 250MB (possibly even a bit more) of data written per hour. Reducing the frequency you allow BOINC to write to disk could help.
                      The new version 545 contains a few amendments to the IO buffering during checkpoint writes. I would be interested to know if this make a difference, could you give us some feedback?
                      Thanks a lot
                      Nick
                      ____________
                      Nicolas Maire
                      Swiss Tropical and Public Health Institute
                      http://www.swisstph.ch

                      paperdragon
                      Avatar
                      Send message
                      Joined: Jun 22 06
                      Posts: 6
                      Credit: 59,381
                      RAC: 0
                      Message 1891 - Posted 12 Jan 2007 23:32:22 UTC

                        I have BOINC set to do disk writes every 5 minutes. Watching MalariaControl It is writing one checkpoint file out every 5 minutes now, instead of doing the 400 writes a second it previous was.

                        Looks like it is writing around 300MB an hour.
                        So at 5:03 it wrote a 21MB file, then at 5:08 wrote 26MB, 5:13 wrote 26MB.

                        Much easier now on system resources.
                        ____________


                        You like Myst? Uru Live returns! www.urulive.com

                        KAMasud
                        Send message
                        Joined: Jan 7 07
                        Posts: 12
                        Credit: 18,733
                        RAC: 0
                        Message 1905 - Posted 14 Jan 2007 17:42:46 UTC


                          :-) Things have improved, thanks :-)
                          ____________

                          Profile maire
                          Volunteer moderator
                          Project administrator
                          Project developer
                          Project scientist
                          Send message
                          Joined: Nov 7 05
                          Posts: 439
                          Credit: 118,258
                          RAC: 0
                          Message 1929 - Posted 16 Jan 2007 16:04:04 UTC - in response to Message 1905.


                            :-) Things have improved, thanks :-)

                            Thats good to hear, thanks!
                            Along with that, the failure rate seems to have dropped a little more with 545.
                            Nick
                            ____________
                            Nicolas Maire
                            Swiss Tropical and Public Health Institute
                            http://www.swisstph.ch

                            Post to thread

                            Message boards : Number crunching : Was looking at IO writes...


                            Return to malariacontrol.net main page


                            Copyright © 2013 africa@home