Difference between revisions of "HowTo:pyrx"

From CAC Wiki
Jump to: navigation, search
(Batch runs)
(Batch runs)
Line 62: Line 62:
 
If you are screening hundreds (or even thousands) of molecules using PyRx the time required may be too much for interactive usage. PyRx offers some basic interface with a scheduler, but the default settings are too non-specific to work with our systems.
 
If you are screening hundreds (or even thousands) of molecules using PyRx the time required may be too much for interactive usage. PyRx offers some basic interface with a scheduler, but the default settings are too non-specific to work with our systems.
  
For the Vina Wizard, we have provided a work-around that allows you to work through a large number of runs using the machines on the SW cluster in parallel. '''Before you are trying to do this [[HowTo:scheduler|read through our Grid Engine help file]] to learn how jobs are submitted to our production clusters.'''
+
For the Vina Wizard, we have provided a work-around that allows you to work through a large number of runs using the machines on the SW cluster in parallel. '''Before you are trying to do this [[HowTo:Scheduler|read through our Grid Engine help file]] to learn how jobs are submitted to our production clusters.'''
  
 
The procedure for this starts off the same as for the interactive approach:
 
The procedure for this starts off the same as for the interactive approach:

Revision as of 14:53, 30 May 2016

PyRx

This is a quick introduction to the usage of the screening software PyRx that is installed on the HPCVL clusters. It is meant as an initial pointer to more detailed information. It also explains a few specific details about local usage.

What is PyRx ?

PyRx is a Virtual Screening software for Computational Drug Discovery that can be used to screen libraries of compounds against potential drug targets. It is a GUI that uses a large body of established open source software such as:

  • AutoDock 4 and AutoDock Vina are used as a docking software.
  • AutoDockTools, used to generate input files.
  • Python as a programming/scripting language.
  • wxPython for cross-platform GUI.
  • The Visualization ToolKit (VTK) by Kitware, Inc.
  • Enthought Tool Suite, including Traits, for application building blocks.
  • Opal Toolkit for running AutoDock remotely using web services.
  • Open Babel for importing SDF files, removing salts and energy minimization.
  • matplotlib for 2D plotting.

Version, Location and Access

The binary executable is in /opt/PyRx on the SW (Linux) Cluster. The present version of the program is 0.9.4 (somewhat modified), and it is available on the Linux platform in its 64 bit version. Therefore, all the relevant executables are in /opt/PyRx/jeff/0.9.4. Documentation can be found at at the main PyRx site.

Running PyRx

Setup

You can run PyRx only on the swlogin1 Linux login node (it won't run on Solaris). From there, the setup for PyRx is very simple. It is only necessary type :

use PyRx

This will enter the proper directory into your PATH and off you go.

Interactive runs

Issuing the command

PyRx

will pop up the GUI. All operations are performed from within that interface. At a minimum, you will have to specify a macromolecule and at least one compound that you want to "dock". These molecules can be specified in several formats such as pdb, pdbq, cif, mol2. You can Import or Load molecules from the

File -> Load Molecule

or the

File -> Import ...

tab.

The actual Analysis is performed using various tabs on the GUI. As an example we outlined the steps using the "Vina Wizard" which runs a software called "Autodock Vina" for the Analysis:

Vina Wizard -> Start Here -> (select /opt/PyRx/0.9.2/bin/vina) -> Start
(highlight Ligands and Macromolecule(s)) -> Forward
(adjust values for Search Space) -> Forward
(check results in bottom window)

There's of course a lot more to it. But the authors of the software claim that it is intuitive enough that you can figure anything out while doing it. Your mileage may vary.

Batch runs

If you are screening hundreds (or even thousands) of molecules using PyRx the time required may be too much for interactive usage. PyRx offers some basic interface with a scheduler, but the default settings are too non-specific to work with our systems.

For the Vina Wizard, we have provided a work-around that allows you to work through a large number of runs using the machines on the SW cluster in parallel. Before you are trying to do this read through our Grid Engine help file to learn how jobs are submitted to our production clusters.

The procedure for this starts off the same as for the interactive approach:

Vina Wizard -> Start Here -> (select Cluster(Portable Batch System)) -> Start
(highlight Ligands and Macromolecule(s)) -> Forward
(adjust values for Search Space) -> Forward

However, in this case the "Cluster" setting was selected and as a result, the program is not actually running any docking software, but rather generates a large number of scripts in a directory

~/.PyRx_workspace/Macromolecules/MACRO

where "MACRO" stands for the name of the macromolecule you are using, and "~" is short for the name of your home directory. To run the actual analysis on our cluster, you now need to go into that directory and execute a "perl" script that we have provided for this purpose:

cd ~/.PyRx_workspace/Macromolecules/MACRO
PyRxVinaArray.pl

This will generate two new sub-directories "jobs" and "logs" and copy the scripts mentioned earlier, then produce a job for our scheduler "Grid Engine", and submit it. Using the qstat command,, you should then be seing something like:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 12:59:29 abaqus.q@sw0013.hpcvl.org          8 64
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 12:48:59 abaqus.q@sw0020.hpcvl.org          8 60
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 12:58:29 abaqus.q@sw0044.hpcvl.org          8 62
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 12:58:59 abaqus.q@sw0047.hpcvl.org          8 63
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 13:05:59 abaqus.q@sw0048.hpcvl.org          8 65
 952371 0.50734 runVina.sh hpcXXXX      r     11/17/2015 13:09:29 abaqus.q@sw0054.hpcvl.org          8 66
 952371 0.50734 runVina.sh hpcXXXX      qw    11/17/2015 09:30:03                                    8 67-511:1

As you can see, it's working on 6 "Vina" jobs simultaneously, with 8 processors each for a total of 48.

Once the "qstat" command does not show anything anymore, the analyses are finished, and you can go back to your PyRX GUI:

-> Forward
(check results in bottom window)

Note that this works only for the analysis with Vina. If you want to do something similar with a different analysis (for instance Autodock4), please get in touch with us. We can probably come up with a solution.

Production runs

To submit a production job on our clusters, you must use the Grid Engine scheduler. To obtain details, read our Grid Engine help file. Production jobs that are run without scheduler will be terminated by the system administrator.

For a Fluent production job, this means that rather than issuing the above batch command directly, you wrap it into a Grid Engine script that looks somewhat like this:

#!/bin/bash
#$ -S /bin/bash
#$ -q abaqus.q
#$ -l qname=abaqus.q
#$ -V
#$ -cwd
#$ -pe shm.pe 12
#$ -m be
#$ -M hpcXXXX@localhost
#$ -o STD.out
#$ -e STD.err
rm fan_1.dat
. /opt/fluent/ansys-16.1/setup_64bit.sh
fluent 3ddp -t$NSLOTS -g -i example.flin

Here we are running the above example batch file "example.flin" using 12 processors on a parallel machine. The output and any error messages from the system are re-directed to files called "STD.out" and "STD.err", respectively. The "#$ -q" and "#$ -l" entries force execution on the Linux cluster (important!). Email notification is handled by the "#$ -m" and "#$ -M" lines. Replace "hpcXXXX" by your actual username and make sure that a file called ".forward" that contains you actual email address is in your home directory. This practice makes it impossible for other users to see your email address.

Many Fluent jobs that you run on our machines are likely to be quite large. To utilize the parallel structure of our machines, Fluent offers several options to execute the solver in a parallel environment, i.e. on several CPU's simultaneously. The default option for such runs is MPI i.e., it uses the Message Passing Interface for inter-process communication.

To take advantage of the parallel capabilities of Fluent, you have to call the program with additional command line options that specify the details of your parallel run:

  • -tn where n is the number of processors requested, e.g. if you want to run with 8 processors, you would use the option -t12
  • -g specifies that the GUI should be surpressed. This is required for batch jobs.

Parallel jobs of longer runtime should only be run in batch using the Grid Engine. The number of processors "12" specified in our example script appears only once, after

#$ -pe shm.pe

which is where you let the Grid Engine know how many processors to allocate to run the program. The internal environment variable NSLOTS will automatically be set to this value and can then be used in the fluent command line.

It is also necessary to source a setup file called setup_64bit.sh. This will set various environment variables and enable the Fluent program to properly interact with Grid Engine. If you are interested, take a look. The file is readable.

All processes are allocated within a single node. This is to make communication more efficient and to avoid problems with the control by Gridengine. The effect of this is that, while still using MPI, Fluent employs a so-called shared-memory layer for communication. The disadvantage is that the size of the job is restricted by the number of cores on a node. Once the script has been adapted (let's call it "fluent.sh"), it can be submitted to the Gridengine by

qsub fluent.sh

from the login node. Note that the job will appear as a parallel job on the Grid Engine's "qstat" or "qmon" commands. Note also that submission of a parallel job in this way is only profitable for large systems that use many CPU cycles, since the overhead for assigning processes, preparing nodes, and communication between them is considerable.

There is an easier way to do this: We are supplying a small perl script called that can be called directly, and will ask a few basic questions, such as the name for the job to be submitted and the number of processes to be used in the job. Simply type

AnsysSubmit

and answer the questions. The script expects a Fluent input file with "file extension" .flin to be present and will do everything else automatically. This is meant for simple Fluent job submissions. More complex job submissions are better done manually.

Further Help

Fluent is a complex software package, and requires some practice to be used efficiently. In this FAQ we can not explain it use in any detail.

The documentation for Fluent can be access from inside the program GUI by clicking on the "Help" button on the upper right. This is in html format. The pdf version of the docs can be found in

/opt/fluent/ansys-16.0/v140/commonfiles/help/en-us/pdf

Fluent documentation is subject to the same license terms as the software itself, i.e. you have to be signed up as a Fluent user in order to access it.

If you are experiencing trouble running a batch command script, check carefully if the sequence of commands is exactly in sync with the program. This might mean typing them in interactively as a test. If you have problems with the Grid Engine, read our FAQ on that subject, and maybe consult the manual for that software which is accessible as a PDF file. HPCVL also provide user support in the case of technical problems: just send email to cac.help@queensu.ca.