Posts by hardy |
|
11)
Message boards :
Unix/Linux :
error code 1
(Message 16391)
Posted 644 days ago by hardy I think this was a bad batch... we corrected it but neglected to respond until now. |
|
12)
Message boards :
Number crunching :
Display of the wu in the course of calculation ?
(Message 16390)
Posted 644 days ago by hardy You mean snazy graphics or tables of what's going on? We track a lot of data in OpenMalaria but most of it can only be represented in tables or simple plots so it's not as interesting to look at. P.S. sorry for late reply. |
|
13)
Message boards :
Number crunching :
Zero credit
(Message 16380)
Posted 646 days ago by hardy I'm not confident that we'll solve the core problem soon, but to prevent this code kicking in for workunits currently going through I've increased the cut-off to above the highest credit mentioned in the logs. That should for the mean-time solve the problem (though it still leaves a bug somewhere in the server code). |
|
14)
Message boards :
Number crunching :
Zero credit
(Message 16373)
Posted 647 days ago by hardy In the server log, we see that your workunit had "too high credit" which should have been reduced. The actual credit set is not exactly zero, but 1.33959190672154e-25 (basically, nothing). We've just been looking through the code responsible for these calculations and cannot understand how the granted credit ends up being so small. Sorry, we'll have to investigate further, but right now have no idea what's going on! |
|
15)
Message boards :
Number crunching :
Zero credit
(Message 16353)
Posted 650 days ago by hardy Well, sorry about that. Off the top of my head I can't think why it might have been marked valid but had no credit granted. |
|
16)
Message boards :
Number crunching :
Running High Priority
(Message 16344)
Posted 651 days ago by hardy Yep, this is part of how our fitting works (see this post for some more info). Basically the server generates new parameter sets by genetic algorithm, creates a set of 61 workunits to test how well this parameter set fits our field data, and feeds the results back into the genetic algorithm to find the best fit. It's all automated, so the server can generate as many workunits as required (until it gets overloaded, but it's been doing pretty well recently). |
|
17)
Message boards :
Number crunching :
Zero credit
(Message 16343)
Posted 651 days ago by hardy Sorry, our server deletes records of old workunits very quickly, so we've lost the record. Do you remember the name of the workunit? |
|
18)
Message boards :
Number crunching :
New Here with AMD 1090T HUGE Disk I/O problem...
(Message 16341)
Posted 651 days ago by hardy Mikey's right, it's pointless spending a lot of time writing checkpoints every 30 seconds or so if they're rarely going to be read. I don't think we can do much more about this though; the checkpoints are big because OpenMalaria (the application) has a lot of working data and I'm not aware that we can control the checkpoint interval from our side. (Well, I guess we could violate the BOINC spec and not have OM write checkpoints when it's told to if another was written recently, but this would partially-break BOINC's normal suspend tasks to disk behaviour, so I don't think it would be much of an improvement.) Cheers guys! |
|
19)
Message boards :
Number crunching :
Maximum elapsed time exceeded
(Message 16287)
Posted 658 days ago by hardy So bad rst_fpops_est figures bite again. Thanks for figuring this out. Well, the good news is that these CMreint workunits are nearly over (though boy, does processing 180000 results on the server take a long time)! The bad news is that we still don't have a better way of estimating work-unit run-time. It shouldn't be impossible to do better than we currently do; it's just one of those things we've not found the time to do. Sorry that I can't offer any promises of getting to this; our software group is currently downsizing. |
|
20)
Message boards :
Number crunching :
Maximum elapsed time exceeded
(Message 16252)
Posted 662 days ago by hardy Well, I've increased the limit (will only affect new workunits). It seems a bit odd if you compare the host's FPOps/sec times run-time with the bound (was 3*10^14, now 1.5*10^15), but lets see whether this increase solves the problem before doing anything else. |