Steps for generating VLASS catalogs (using Docker image)

To create VLASS catalogs, the following pipelines need to be executed sequentially.

  1. Pipeline 1: This pipeline is responsible for downloading Very Large Array Sky Survey (VLASS) images from the National Radio Astronomy Observatory (NRAO) website. It then carries out source extraction on these images using PyBDSF to create a basic component catalog and a subtile catalog, typically containing about 3 million components and 35,000 entries, respectively. To learn more see here.

  2. Pipeline 2: The pipeline performs additional QA on both the component catalog and the subtile catalog. It also creates a host ID table, which is currently being replaced by a source table generated by Pipeline 4. To learn more see here.

  3. Pipeline 3: This pipeline employs a self-organizing map (SOM) to cluster radio components from the VLASS Component Catalog based on morphology. It enhances the component catalog by adding four new columns to the basic component catalog. To learn more see here.

  4. Pipeline 4: This pipeline runs DRAGNhunter to identify likely doubles in VLASS and potential host candidates from AllWISE. The likelihood ratio code is then used to identify the most likely correct host for these doubles, as well as hosts for single-component sources. Where a double source has a radio core identified, in a few percent of cases the core and host identifications disagree. In such cases the core position is used to select the host over the lieklihood ratio identification. To learn more see here.

The hardware requirements and other details for each pipeline is documented in a following table.

Final deliverables

  • 1. Component catalog
  • 2. Host ID catalog
  • 3. Subtile catalog
  • 4. Double radio source catalog

For more detailed documentation, visit https://cirada.ca/vlasscatalogueql0 and the Catalog User Guide.

Note:

All these pipelines and their dependencies are already present within the docker image. VLASS survey is conducted in three epochs and imaged using two different procedures (Quick Look or QL vs Single Epoch or SE). Steps requiring user input regarding the epoch (1,2 or 3) and the type of data (QL or SE) are highlighted in yellow.

Pipeline Outputs Timeline Files Used Intermediate Products to Delete (Use caution: only after vetting) Hardware requirements
1 Two catalogs (.csv), 1) intermediate component catalog and 2) intermediate subtile catalog in "$PIPE1/data/products" >~2 weeks manifest.csv (Go to Step 2) $PIPE1/data/tiles. For SE data, DO NOT delete until after pipeline 5 2 TB disk space
2 Four catalogs (.csv) in catalogue_output_files. Subtile catalog is final. 1~2 weeks Outputs of pipeline 1 (automatically taken; no user input required) $PIPE2/host_table/LR_output Neither computationally nor disk space intensive
3 Component catalog (.csv) in $PIPE3_2 with four new columns and updated Quality_flag, >~3 weeks Outputs of pipeline 2 (follow the steps listed) $PIPE3_1/data_out/VLASS/ First half requires ~1 TB disk space (approximate); whereas second half requires access to a GPU
4 dragns.fits, sources.fits in $PIPE4 ~1 weeks Outputs of pipeline 3_2 output_files/supplementarydata Neither computationally nor disk space intensive
SPIX (only for SE) Final component catalog (.csv) in $PIPE1 ~1 weeks Outputs of pipeline 3_2 ~4 TB disk space

The procedures for accessing, installing the necessary packages, and running each pipeline are outlined below.

Starting Docker

  1. Step 1: Starting up the docker terminal in an interactive view.
    sudo docker run -it cont_pipe
    
    If you are on CANFAR,
    bash
    
  2. Step 2. Setting up the workdir. By default the workdir is set to \tmp in the docker image. However, if the user wishes to change the workdir, edit the first line of the file set_workdir.sh to point to your desired location. Run this file to copy all the files there.
vim set_workdir.sh
. ./set_workdir.sh
cd $WORKDIR

Pipeline 1 and 2 (Source detection)

Step 1: Setting up virtual environment

conda activate myenv

Step 2: Getting ready to run the pipelines.

The first step is to get a file with the URLs of all VLASS fits images on which the source finding is to be done. One can use get_urls_from_nrao.py for this. The usage is as follows.

cd $PIPE1 
python get_urls_from_nrao.py <URL> <mode> I

Here, an example for URL is https://archive-new.nrao.edu/vlass/se_continuum_imaging/VLASS2.1 . To find the relevant URLs visit https://archive-new.nrao.edu/vlass/quicklook/. There are two modes, 'w' for write and 'a' for append. The two options for imagetype are alpha and I. Choose I. The code generates an output file named "manifest.csv". In some cases where there are multiple folders, such as "VLASS2.1" and "VLASS2.2", it may be necessary to run the code a second time in "append mode". This will allow the code to add new entries to the existing "manifest.csv" file instead of overwriting it. Now copy the file, "manifest.csv" to the subfolder media/manifests/.

cp manifest.csv media/manifests

Step 3: Run the pipeline. Pipeline 1 is written in a modular form, and each of the steps is run in series and in the background. Run the following commands in the specified order:

rm -rf data
python3 catenator.py flush # pipeline cleanup utility
python3 catenator.py configure # This gives you 3 options test and v1. Choose pipeline_v1.yml.
. ./pipe1and2.sh

The logs generated by the pipeline 1 can be found in the data/logs directory. To ensure that each step of the process has been completed successfully, it is recommended to run "python3 catenator.py monitor" after each step.

Similarly, please review the output log files of pipeline 2 to ensure that the pipeline has not terminated unexpectedly during any of the steps. It is recommended to examine the output file of each step. As an illustration, you may execute the command tail -300 step1.out after completing the first step.

Pipeline outputs¶

The pipeline 1 output catalogs (with .csv extension) will be present in the folder "data/products". Check out the final outputs of the pipeline 2 in the folder "catalogue_output_files". Expected timeline: >~3-4 weeks

Outputs vetting¶

The code for vetting the outputs are also present in $PIPE1, namely Vetting_pipeline1_outputs.ipynb. If you are on CANFAR, navigate to /tmp/continuum_bdp_catalogue_generator under the folder structure on the left. Double click on the notebook to launch. Run and make sure the stats and plots are OK.

Pipeline 3 (SOM pipeline)

Step 1: Setting up virtual environment

conda deactivate
conda activate cutoutenv

Step 2 Copy the output csv file from vlass_cat folder to the current working directory.

cd $PIPE3_1
cp $PIPE2/catalogue_output_files/*_duplicate_free.csv .

Step 3 Edit cutout_provider_core_old/core/vlass.py to select the relevant epoch by opening in any text editor. Uncomment the relevant line (lines 104 -109). e.g., if ("3.1.ql" in url and "tt0" in url) or ('3.2.ql' in url and 'tt0' in url):

vim core/vlass.py

Step 4 Downloading the cutouts at source positions using cutout_provider_core. Rename the component catalog name (line 4) in the file 'splitandcreate.sh' to match the output from previous step. For example, ORIGINAL_FILENAME="VLASS3QLCIR_components_duplicate_free.csv"

vim splitandcreate.sh
. ./splitandcreate.sh
python3 fix_headers.py 3

Step 5

conda deactivate
conda activate sidelobe_pipe_env
cd $PIPE3_2
vim sidelobe_pipeline.py

Step 6 Run the pipeline.

. ./pipe3_2.sh

Description of the format¶

python3 sidelobe_pipeline.py <catalogue> <outfile> -p <path_to_image_cutouts> -s <som_file> -n <neuron_table> [--cpu] [--overwrite]

where the parameters mean the following

  • catalogue: The name of the input catalogue (csv or fits).
  • outfile: The desired name for the output catalogue.
  • path_to_image_cutouts: The directory that contains all of the VLASS cutouts.
  • som_file: The name of the SOM binary file (SOM_B3_h10_w10_vlass.bin).
  • neuron_file: The name containing the information (neuron_info.csv).
  • --cpu: A Boolean flag. If set, it runs the Mapping stage in CPU mode (Warning: Slow).
  • --overwrite: A Boolean flag. If set, it overwrites the Image and Mapping binaries.

Pipeline outputs¶

  • Image binary (preprocessed images)
  • Catalogue of components that failed preprocessing (fits file)
  • Catalogue of components that passed preprocessing (fits file)
  • Mapping binary
  • Transform binary
  • Final output catalogue

Pipeline 4 (DRAGN hunter)

Step 1: Setting up virtual environment

conda activate pip4env

Step 2: Run the pipelines.

cd $PIPE4
python hunt_dragns_and_find_host.py <component_filename_after_som.csv>

where <component_filename_after_som.csv> is the filename of the radio catalogue data file. The code runs perfectly fine with the component catalog as well.

By default this code will attempt to find redshifts for hosts it identifies. In some instances this step can fail, e.g. as a result of the CDS server bing down or a broken connection, etc. In the event of this happening obtaining the redshifts can be reattempted without repeating the slow step of host finding by calling:

python fetch_z.py

There are a number of optional arguments to this _fetchz.py which can be invoked (see help) but the default setup should be okay for most cases where this needs to be run. If still struggling with timeout issues, then the _--chunksize argument should be used to query smaller data chunks at any one time.

Pipeline outputs¶

The output of the code consists of a table of double sources (dragns.fits) with properties and host IDs, and a table of all sources (single components and doubles) with host identifications and redshifts where available. These are output to the directory _outputfiles. A number of suplementary provide more extensive meta data and are output to in the _output_files/supplementarydata folder.

Additional pipeline for extracting spectral index images

- To be run on single epoch images¶

Spectral index values can now be extracted with the release of single epoch data. To incorporate four additional columns including the spectral index, the error, Alpha_quality_flag and the updated error into the single epoch catalog, please follow the steps outlined below. It is recommended to execute this pipeline on the component catalog after completing the preceding three pipelines.

Step 1: Setting up virtual environment

conda activate myenv

Step 2: Getting ready to run the pipelines. The first step is to download the spectral index and error images. First, cd into the pipeline 1 directory.

cd $PIPE1

Next we need to get a file listing all the URLs of all VLASS spectral index images and errors. One can use get_urls_from_nrao.py for this. The usage is as follows.

python get_urls_from_nrao.py <URL> <mode> alpha

Back up the existing manifest.csv and copy the newly generated one to media/manifests directory.

mv media/manifests/manifest.csv media/manifests/manifest_I.csv
mv manifest.csv media/manifests/

Step 3: Download spectral index images using pipeline 1.

python3 catenator.py flush # pipeline cleanup utility
python3 catenator.py configure # This gives you 2 options test and v1. Choose 1.
python3 catenator.py download # This can take time depending on number of files that need to be downloaded

Step 4: Wait until download is finished and move the alpha images.

mkdir $PIPE1/data/alpha/
mv $PIPE1/data/tiles/*/*alpha* $PIPE1/data/alpha/

Step 5: Run add_spix.py

python3 add_spix.py

Automated version

All the above steps were automated and can be run non-interactively.

Step 1. Setting up the workdir. By default the workdir is set to \tmp in the docker image. However, if the user wishes to change the workdir, edit the first line of the file set_workdir.sh to point to your desired location. Run this file to copy all the files there.

vim set_workdir.sh
. ./set_workdir.sh
cd $WORKDIR

Step 2. Now run the automate.py to generate the bash scripts that will run everything non-interactively. The python code will prompt the user to input the type of the catalog and the epochs.

python automate.py

Step 3. The above code will generate pipe1andpipe2.sh and pipe3andpipe4.sh. We have generated two sets of bash scripts because, it is a good idea to check the outputs after pipeline 1 and 2 before proceeding to pipeline 3 and 4.

These can then be run with the following commands.

. ./pipe1andpipe2.sh
cd $WORKDIR
. ./pipe3andpipe4.sh

TBD¶

For enabling GPUs for pipeline 4, see below.

pip install skaha
from skaha.session import Session

session = Session()


session_id = session.create(
name="testsom",
image="images.canfar.net/cirada/cont_pipe:v1",
cores=4,
ram=16,
kind="headless",
cmd="bash",
args="pipe3andpipe4.sh",
replicas=1,
gpu=1 # try to see whether GPU is faster
)