Network rendering server only uses 1 core

FractalGnome

posted Apr 1 '23 at 12:31 pm

I've tried both with the client on my TR Pro and the server on the 10/20 core i7-6950x and the other way around; regardless of which direction it's going, only one core on the machine set up as the server is used. (tested that just in case it was deciding offload wasn't worthwhile). I checked affinity and it was set to all cores for the server process and the client was set to 20 cores on the 6950x in case that matters. Multiple processes of the server can't be started but multiple connections from the same source can be made so I tried that; all of them ran on the same core and it just slowed things down.

Also tried the various rendering methods, turning off anti-aliasing, etc. Looking at the iterations count in the server monitor, it seemed to display a large jump every so often, like it would occasionally run on all cores and switch back. No CPU use by anything but Windows processes on either when I tested so the low priority default shouldn't have mattered (and shouldn't affect threading anyway).

The old machine is around 1/3 as fast as the new so another 30% speed is pretty good considering the time larger images or animation can take. Network connection is through an unmanaged switch + the base router and I'm getting 120MB/s between machines on SMB and under 1ms pings so I doubt that's an issue.

Any settings I might be missing or other things I should try or is this a bug?

I've tried both with the client on my TR Pro and the server on the 10/20 core i7-6950x and the other way around; regardless of which direction it's going, only one core on the machine set up as the server is used. (tested that just in case it was deciding offload wasn't worthwhile). I checked affinity and it was set to all cores for the server process and the client was set to 20 cores on the 6950x in case that matters. Multiple processes of the server can't be started but multiple connections from the same source can be made so I tried that; all of them ran on the same core and it just slowed things down. Also tried the various rendering methods, turning off anti-aliasing, etc. Looking at the iterations count in the server monitor, it seemed to display a large jump every so often, like it would occasionally run on all cores and switch back. No CPU use by anything but Windows processes on either when I tested so the low priority default shouldn't have mattered (and shouldn't affect threading anyway). The old machine is around 1/3 as fast as the new so another 30% speed is pretty good considering the time larger images or animation can take. Network connection is through an unmanaged switch + the base router and I'm getting 120MB/s between machines on SMB and under 1ms pings so I doubt that's an issue. Any settings I might be missing or other things I should try or is this a bug?

reply

Eric B

posted Apr 1 '23 at 5:08 pm

UF Server for Mac does not seem to be hyperthreaded. For animations, I have to do old-school network rendering by manually splitting the render across my Macs. I can't run a render farm.

Edit: I should have mentioned that the screenshot above is i5 7th gen not Apple Silicon/Rosetta. Below is i7 8th gen and again UF Server is using only half the cores.

UF Server for Mac does not seem to be hyperthreaded. For animations, I have to do old-school network rendering by manually splitting the render across my Macs. I can't run a render farm. ![642855d60a659.png](serve/attachment&path=642855d60a659.png) Edit: I should have mentioned that the screenshot above is i5 7th gen not Apple Silicon/Rosetta. Below is i7 8th gen and again UF Server is using only half the cores. ![6429ef3cbcc31.png](serve/attachment&path=6429ef3cbcc31.png)

http://www.youtube.com/fractalzooms

edited Apr 2 '23 at 10:13 pm

reply

FractalGnome

posted Apr 7 '23 at 5:24 pm

Honestly those mac graphs look like it's still just running one thread but Mac OS is switching the compute thread around to different physical cores. Presumably the other 6 threads of the process are UI and networking stuff, or it's got 3 more compute threads incorrectly waiting on the current one to finish.

Windows tends to keep threads from a single process on one core, especially if there are some cores that can clock higher than others like on most reasonably modern processors, but it depends on power settings. Changing cooling from active to passive and "balanced" power settings would probably get me a graph roughly similar to yours with the scheduler trying to avoid hotspots on the die. Apple prioritized keeping noise levels at a minimum over active cooling as a more or less hardcoded behavior on everything but Mac Pro so I imagine they did something similar.

With both operating systems higher utilization threads get placed on physical cores before logical / hyperthreaded cores so that makes sense too, there's not high enough utilization to benefit from hyperthreading yet in either graph.

What I find bizarre is that (at least on Windows) the server is just a shortcut to the main uf6 executable with the /Server command line parameter and a different icon so it doesn't really seem like the render threading behavior would differ that much.

Honestly those mac graphs look like it's still just running one thread but Mac OS is switching the compute thread around to different physical cores. Presumably the other 6 threads of the process are UI and networking stuff, or it's got 3 more compute threads incorrectly waiting on the current one to finish. Windows tends to keep threads from a single process on one core, especially if there are some cores that can clock higher than others like on most reasonably modern processors, but it depends on power settings. Changing cooling from active to passive and "balanced" power settings would probably get me a graph roughly similar to yours with the scheduler trying to avoid hotspots on the die. Apple prioritized keeping noise levels at a minimum over active cooling as a more or less hardcoded behavior on everything but Mac Pro so I imagine they did something similar. With both operating systems higher utilization threads get placed on physical cores before logical / hyperthreaded cores so that makes sense too, there's not high enough utilization to benefit from hyperthreading yet in either graph. What I find bizarre is that (at least on Windows) the server is just a shortcut to the main uf6 executable with the /Server command line parameter and a different icon so it doesn't really seem like the render threading behavior would differ that much.

reply

Eric B

posted Apr 7 '23 at 9:18 pm

Super helpful explanation. Thanks. I didn’t know any of that.

UF Server Mac being a shortcut isn’t something I know how to confirm so I don’t think I can return the favor. I figured UF Server was a standalone app which didn’t yet have all the features of UF and that was why it wasn’t using all cores or calculating motion blur or handling multiple connections.

Super helpful explanation. Thanks. I didn’t know any of that. UF Server Mac being a shortcut isn’t something I know how to confirm so I don’t think I can return the favor. I figured UF Server was a standalone app which didn’t yet have all the features of UF and that was why it wasn’t using all cores or calculating motion blur or handling multiple connections.

http://www.youtube.com/fractalzooms

reply

Frederik Slijkerman

posted Apr 11 '23 at 5:13 pm

It's currently a limitation of the network server that it only uses one thread per connection. It's on my wish list to improve this for a future release, but for now, what you could/should do is to make multiple connections to the same server.

Ultra Fractal author

reply

FractalGnome

posted Apr 23 '23 at 3:54 pm

Hi Frederik,

Oddly the last time I tried this when I was using the 4 NUMA node config it didn't work and only used one server core regardless like I wrote. I switched back to single-node, now it seems to be delegating the 20 server connections off correctly.
The only caveats are that:

I need to set the client machine to use one less core than its total HT core count or the network communication isn't given quite enough time, and the server only hits 80% CPU. Dropping a client core doesn't affect speed much since it needed to be handling interrupts from the onboard NIC anyway. Once my 40/56 converged adapters get here and I have an RoCE connection between the two computers going and much better hardware offload going I'll give it another go. Onboard NICs are generally pretty terrible for compute offload since the CPU needs to handle too much. As usual, the driver "hardware offload" options are a lie. :D
Render speed drops considerably on the client (down to roughly 50% with no other system activity) when UF6 isn't the foreground Window. This is probably fixable by setting it to high priority but a bit odd, usually Windows won't take that much CPU time from a process unless there's a lot of other activity. I didn't check the server machine during this.

I didn't test the reverse configuration either, because I won't be using it in practice but mainly because I didn't feel like adding 64 server connections in the network Window.

Oddly switching back away from L3 as SRAT (so the Threadripper shows as one NUMA node in Windows again) broke the questionable fix I was using to make UF use all cores in the 4 node config. Since I have no clue how or why that was working and it relied on using a SysInternals utility for something it wasn't meant to be doing

So yeah, that solution works for now. As long as the network config sticks I don't have a problem doing it that way.

Hi Frederik, Oddly the last time I tried this when I was using the 4 NUMA node config it didn't work and only used one server core regardless like I wrote. I switched back to single-node, now it seems to be delegating the 20 server connections off correctly. The only caveats are that: - I need to set the client machine to use one less core than its total HT core count or the network communication isn't given _quite_ enough time, and the server only hits 80% CPU. Dropping a client core doesn't affect speed much since it needed to be handling interrupts from the onboard NIC anyway. Once my 40/56 converged adapters get here and I have an RoCE connection between the two computers going and much better hardware offload going I'll give it another go. Onboard NICs are generally pretty terrible for compute offload since the CPU needs to handle too much. As usual, the driver "hardware offload" options are a lie. :D - Render speed drops considerably on the client (down to roughly 50% with no other system activity) when UF6 isn't the foreground Window. This is probably fixable by setting it to high priority but a bit odd, usually Windows won't take that much CPU time from a process unless there's a lot of other activity. I didn't check the server machine during this. I didn't test the reverse configuration either, because I won't be using it in practice but mainly because I didn't feel like adding 64 server connections in the network Window. Oddly switching back away from L3 as SRAT (so the Threadripper shows as one NUMA node in Windows again) broke the questionable fix I was using to make UF use all cores in the 4 node config. Since I have no clue how or why that was working and it relied on using a SysInternals utility for something it wasn't meant to be doing So yeah, that solution works for now. As long as the network config sticks I don't have a problem doing it that way.

reply

Network rendering server only uses 1 core

Pending draft

Edit history