From: John Stone (johns_at_ks.uiuc.edu)
Date: Tue Jan 31 2023 - 22:30:35 CST

Hi,
  I'm late seeing this due to being away from email for a bit.

If the issues you encounter occur with multi-GPU, IMHO, one of the first
things to check into is whether or not IOMMU is enabled or disabled
in your Linux kernel, as that has been a source of this scenario
at some points in the past.

I'm assuming that the GPUs are not necessarily
NVLink-connected, but this can be queries like this:

% nvidia-smi topo -m
        GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NV4 0-63 N/A
GPU1 NV4 X 0-63 N/A

Legend:

  X = Self
  SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX = Connection traversing at most a single PCIe bridge
  NV# = Connection traversing a bonded set of # NVLinks

Please share the results of that query and let me know what it says.
The above output is from one of my test machines with a pair of
NVLink-connected A6000 GPUs, for example.

Best,
  John Stone

On Tue, Jan 31, 2023 at 02:48:10PM -0800, Brendan Dennis wrote:
> Hi Josh,
> The problem we are experiencing is that, on new systems with multiple GPUs
> with compute capability 8.6 (which requires CUDA 11.1+), rendering with
> TachyonL-OptiX produces a checkered pattern across the output. If we then
> use the same exact compilation of VMD 1.9.4a57 (CUDA 11.2, OptiX 6.5.0) on
> systems with older GPUs, we do not have this checkered pattern problem in
> the output. So, it's not so much that we're having problems with OptiX 6.5
> specifically, but rather that we're having problems with VMD rendering on
> SM 8.6 GPUs. Although I can't determine for sure that OptiX 6.5.0 is the
> problem-causing part of this, the fact the OptiX release notes only start
> mentioning compatibility with CUDA 11.1+ in the 7.2.0 release is what made
> me think this might be an OptiX version issue.
> However, I had some further troubleshooting ideas after thinking things
> through while reading your reply and typing up the above, and I've now
> been able to verify that the checkered output problem goes away if I use
> the¬ VMDOPTIXDEVICE¬ envvar at runtime to restrict VMD to using a single
> GPU in one of these dual A5000 systems. It doesn't matter which GPU I
> restrict it to though; if I render on one GPU, then exit and relaunch VMD
> to switch to rendering with the other GPU, both renders turn out fine. But
> if I set VMDOPTIXDEVICE or VMDOPTIXDEVICEMASK in such a way as to allow
> use of both GPUs, the checkering problem comes back.
> After doing some more digging into¬ how these systems were purchased and
> built by¬ the vendor, it looks like the lab actually bought them with an
> NVLink interconnect in place between the two A5000 GPUs. Although I am
> getting no verification of the NVLink interconnect being available via
> nvidia-smi or similar tools, VMD is reporting a GPU P2P link as being
> available. So, I'm now wondering if the lack of CUDA 11 support in pre-v7
> OptiX was a misdirect, and that this might actually be some sort of issue
> with NVLink instead.
> I can't really find any documentation for¬ VMD and NVLink, so I'm not
> quite sure how one is supposed to tune VMD to work with NVLink'd GPUs, or
> if it's all supposed to be automatic. Who knows, maybe it'll still wind up
> being a pre-v7 OptiX problem specifically with NVLink'd SM 8.6+ GPUs.
> Regardless, for now I've asked someone who is on-site to see if they can
> check one of the workstations for a physical NVLink interconnect, and to
> then remove it if they find it. Once that's done, I'll give VMD another
> try, and see if I still run into this checkering issue without the NVLink
> interconnect being in place.
> --
> Brendan Dennis (he/him/his)
> Systems Administrator
> UCSD Physics Computing Facility
> [1]https://pcf.ucsd.edu/
> Mayer Hall 3410
> (858) 534-9415
> On Tue, Jan 31, 2023 at 12:05 PM Vermaas, Josh <[2]vermaasj_at_msu.edu>
> wrote:
>
> Hi Brendan,
>
> ¬
>
> My point is that OptiX 6.5 works just fine with newer versions of CUDA.
> That is what we use in my lab here, and we haven‚**t noticed any
> graphical distortions. As you noted, porting VMD‚**s innards to a newer
> version of OptiX is something beyond the capabilities of a single
> scientist with other things to do for a dayjob. ū*** Do you have a
> minimal working example of something that makes a checkerboard in your
> setup? I‚**d be happy to render something here just to demonstrate that
> 6.5 works just fine, even with more modern CUDA libraries.
>
> ¬
>
> -Josh
>
> ¬
>
> From: Brendan Dennis <[3]bdennis_at_physics.ucsd.edu>
> Date: Tuesday, January 31, 2023 at 2:17 PM
> To: "Vermaas, Josh" <[4]vermaasj_at_msu.edu>
> Cc: "[5]vmd-l_at_ks.uiuc.edu" <[6]vmd-l_at_ks.uiuc.edu>
> Subject: Re: vmd-l: Running VMD 1.9.4alpha on newer GPUs that require
> CUDA 11+ and OptiX 7+
>
> ¬
>
> Hi Josh,
>
> ¬
>
> Thanks for the link, from looking at your repo it looks like we both
> figured out a lot of the same tweaks needed to get VMD building from
> source on newer systems with newer versions of various dependencies and
> CUDA. Unfortunately though, I don't think tweaking of the configure
> scripts or similar will get VMD building against OptiX 7, as NVIDIA made
> some pretty substantial changes in the OptiX 7.0.0 release that VMD's
> OptiX code doesn't yet reflect. Although it looks like the relevant
> portions of code in the most recent standalone release of Tachyon
> (0.99.5) have been rewritten to support OptiX 7, those changes have not
> been ported over to VMD's internal Tachyon renderer (or at least not as
> of VMD 1.9.4a57), and sadly it's all a bit over my head to port it
> myself.
>
> --
>
> Brendan Dennis (he/him/his)
>
> Systems Administrator
>
> UCSD Physics Computing Facility
>
> [7]https://pcf.ucsd.edu/
>
> Mayer Hall 3410
>
> (858) 534-9415
>
> ¬
>
> ¬
>
> On Tue, Jan 31, 2023 at 6:58 AM Josh Vermaas <[8]vermaasj_at_msu.edu>
> wrote:
>
> Hi Brendan,
>
> I've been running VMD with CUDA 12.0 and OptiX 6.5, so I think it can
> be done. I've put instructions for how to do this on github.
> [9]https://github.com/jvermaas/vmd-packaging-instructions. This set of
> instructions was designed with my own use case in mind, where I have
> multiple Ubuntu machines all updating from my own repository. This
> saves me time on installing across the multiple machines, while
> respecting the licenses to both OptiX and CUDA. There may be some
> modifications you need to do for your own purposes, as admittedly I
> haven't updated the instructions for more recent alpha versions of
> VMD.
>
> -Josh
>
> On 1/30/23 9:16 PM, Brendan Dennis wrote:
>
> Hi,
>
> ¬
>
> I provide research IT support to a lab that makes heavy use of VMD.
> They recently purchased several new Linux workstations with NVIDIA
> RTX A5000 GPUs, which are only compatible with CUDA 11.1 and above.
> If they attempt to use the binary release of VMD 1.9.4a57, which is
> built against CUDA 10 and OptiX 6.5.0, then they run into problems
> with anything using GPU acceleration. Of particular note is
> rendering an image using the internal TachyonL-OptiX option; the
> image is rendered improperly, with a severe checkered pattern
> throughout.
>
> ¬
>
> I have been attempting to compile VMD 1.9.4a57 from source for them
> in order to try and get GPU acceleration working. Although I am able
> to compile against CUDA 11.2 successfully, the maximum version of
> OptiX that appears to be supported by VMD is 6.5.0. When built
> against CUDA 11.2 and OptiX 6.5.0, the image checkering still occurs
> on render, but is not nearly as severe as it was with the CUDA 10
> binary release. My guess is that some version of OptiX 7 is also
> needed to fix this for these newer GPUs.
>
> ¬
>
> In researching OptiX 7 support, it appears that how one would use
> OptiX in one's code changed pretty substantially with the initial
> 7.0.0 release, but also that CUDA 11 was not supported until the
> 7.2.0 release. It additionally looks like Tachyon 0.99.5 uses OptiX
> 7, and I was able to build the libtachyonoptix.a library with every
> OptiX 7 version <= 7.4.0. However, there does not appear to be a way
> to use this external Tachyon OptiX library with VMD, as all of VMD's
> OptiX support is internal.
>
> ¬
>
> Is there any way to use an external Tachyon OptiX library with VMD?
> If not, is there any chance that support for OptiX 7 in VMD is not
> too far off on the horizon,¬ perhaps even in the form of a new alpha
> Linux binary release built against CUDA 11.1+ and OptiX 7.2.0+? For
> now, I've had to tell people that they'll need to make due with
> using the Intel OSPray or other CPU-based rendering options, but I
> imagine that's going to get frustrating fairly quickly as they watch
> renders take minutes on their brand new systems, while older
> workstations with older GPUs can do them in seconds.
>
> --
>
> Brendan Dennis (he/him/his)
>
> Systems Administrator
>
> UCSD Physics Computing Facility
>
> [10]https://pcf.ucsd.edu/
>
> Mayer Hall 3410
>
> (858) 534-9415
>
> --
>
> Josh Vermaas
>
> ¬
>
> [11]vermaasj_at_msu.edu
>
> Assistant Professor, Plant Research Laboratory and Biochemistry and Molecular Biology
>
> Michigan State University
>
> [12]vermaaslab.github.io
>
> References
>
> Visible links
> 1. https://urldefense.com/v3/__https://pcf.ucsd.edu/__;!!DZ3fjg!7WznxmGYdNP1ickEiE86w_igykHV47KV_csqJyKtwcQuzdUhMfVve-1AUiKjKjBKKEO1JuaLRxjYq4QOXPu6fsLEnw$
> 2. mailto:vermaasj_at_msu.edu
> 3. mailto:bdennis_at_physics.ucsd.edu
> 4. mailto:vermaasj_at_msu.edu
> 5. mailto:vmd-l_at_ks.uiuc.edu
> 6. mailto:vmd-l_at_ks.uiuc.edu
> 7. https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6Yge5Vo5-OE$
> 8. mailto:vermaasj_at_msu.edu
> 9. https://urldefense.com/v3/__https:/github.com/jvermaas/vmd-packaging-instructions__;!!Mih3wA!CpCXGIkyDLgkiLgg6XYyO8rPhE9542sEIOdi43gpxDKn7YboDflWtoPUOT5kOJhsyyEB0p6PdIdEKB-amahcGR4$
> 10. https://urldefense.com/v3/__https:/pcf.ucsd.edu/__;!!DZ3fjg!6Pk3uKQJXsVVUBSNiEN5nlGSFRbvhvd-zrWzv6WpfLenvQEvVvxE_ux5Q9DAtJmubWIicqFWxYWVawU-ciHx-3E1Yw$
> 11. mailto:vermaasj_at_msu.edu
> 12. https://urldefense.com/v3/__http:/vermaaslab.github.io__;!!HXCxUKc!y2kuOQIcWLv8EUaV3wpNMykOrLfVi5PJhmvm_sXJ5RCLM8fdDhHB6Zb_01wcuCnk3RMahrrqkmic6YgenUOnTvw$

-- 
Research Affiliate, NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
http://www.ks.uiuc.edu/~johns/           
http://www.ks.uiuc.edu/Research/vmd/