CANFAR

Introduction

The Canadian Advanced Network for Astronomy Research (CANFAR) is a consortium that serves data-intensive storage, access, and processing needs of university groups and centres engaged in astronomy research.

To get started on CANFAR you will need to do the following:

  1. Request an account on CADC
  2. Send an email to CANFAR support with your CADC username, the resources your need and a few sentences explaining what you are working on.


Virtual Machine

The virtual machine (VM) is a space where we can install software under a given Linux distribution with given CPU, RAM and storage limits. Once we are happy with a given set-up we can freeze these conditions (i.e. all the software versions etc. currently installed) by creating a snapshot that acts like a container for the VM. Jobs can then be submitted through the batch system using a given snapshot.

Note: All processing (except very minor tests) should be done through the batch system and not run directly on the VM.

These are the recommended steps to follow in order to set up a VM on CANFAR:

  1. Create a VM:

    Follow the instructions on CANFAR quick start.

    VMs can be managed on OpenStack.

    Note: An IP address has to be assigned to the VM in order to be able to log in and there are a limited number of IPs per workspace.

  2. SSH to VM:

    Run the following to connect to a given VM:

    ssh ubuntu@IP_ADDRESS

    This will connect you to a generic ubuntu user space, shared between all users. Once connected software etc. can be installed and tested.

    Note: You should only really be connecting to the VM with the intention of creating a new snapshot or running tests with the current set-up. Avoid making any software changes not intended for a new snapshot. Note: The person who creates the VM will have to manually added the SSH keys of any other potential user.

  3. Install the following tools:

    sudo apt update
    sudo apt install git
    sudo apt install make
    sudo apt install autoconf
    sudo apt install libtool
  4. Install miniconda:

    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh
    bash
  5. Install VOSPACE client:

    pip install vos

    The VOSPACE client is needed to transfer data to/from the VOSPACE.

  6. Generate certificate to access VOSPACE:

    cadc-get-cert -u USERNAME

    When asked, enter your CADC password. A CADC certificate is also needed to transfer data to/from the VOSPACE.

  7. Create a snapshot of the VM status:

    On OpenStack under "Instances" click the "Create Snapshot" button for the corresponding VM. Be sure to follow the snapshot naming scheme defined for the VM above.

  8. Update VM and create a new snapshot:

    The VM set-up only needs to be done once, afterwards the VM can simply be modified for new snapshots. e.g. pull the latest changes to a given software package and install any new dependencies then repeat step 7.


Batch System

The batch system is a server where jobs can be submitted to the CANFAR cluster using a previously defined snapshot.

Note: You will have to request access to the batch system before you can connect.

  1. SSH to batch system:

    You can connect to the batch system as follows:

    ssh <USERNAME>@batch.canfar.net

    You will be connected to a personal user space.

  2. Source OpenStack environment variables, e.g. for the lensing project:

    source lensing-openrc.sh

    When asked, enter your CADC password. This is a necessary step before submitting jobs.

  3. Create a bash script, for example:

    The bash script defines the command lines to be run on the snapshot. The following example script demonstrates how to:

    • activate the ShapePipe environment,
    • create an output directory,
    • copy a configuration file to the snapshot from the VOSPACE,
    • run ShapePipe,
    • and copy the output back to the VOSPACE
    #!/bin/bash
    export VM_HOME=/home/ubuntu
    source $VM_HOME/miniconda3/bin/activate <MY_CONDA_ENV>
    mkdir output
    <MY_SCRIPT>.py -o output
    vcp --certfile=$VM_HOME/.ssl/cadcproxy.pem output vos:cfis/cosmostat/<USERNAME>

    Note: The default path for a snapshot is not the /home/ubuntu directory, hence the definition of the VM_HOME environment variable.

  4. Create a job file, for example:

    The job file defines the script to be run (i.e. the bash script previously defined), the corresponding outputs and the computational requirements for the job.

    executable     = <MY_SCRIPT>.bash
    
    output         = <MY_SCRIPT>.out
    error          = <MY_SCRIPT>.err
    log            = <MY_SCRIPT>.log
    
    # Make sure the requested resources do not exceed what was
    # specified for the VM
    request_cpus   = 1
    request_memory = 8G
    request_disk   = 10G
    
    queue
    
  5. Submit a job:

    Jobs are submitted using the canfar_submit command followed by the previously defined job file, the name of the desired snapshot and the flavour of the corresponding VM.

    > canfar_submit JOB_FILE SNAP_SHOT FLAVOUR
  6. Check queue:

    condor_status -submitter

This command tells you running, idle, and held jobs for you and other users.

Information for your own jobs only:

condor_q [-nobatch]

From there you can get the job ID, which lets you examine your job more closely:

condor_q -better-analyse <ID>

You can do an ssh to the VM that is (or will be) running your job for checking:

condor_ssh_to_job -auto-retry <ID>

For multi-job submissions, the JOB_IDS has subnumbers, e.g. 1883.0-9. You can ssh to each of those VMs, with e.g.

condor_ssh_to_job -auto-retry 1883.6


Troubleshooting

If the above condor commands do not help, try:

cloud_status -m

to check the status of all VMs.

Sometime a snap shot image is not (yet) active and shared, since its creation can take a lot of time. Check the status with:

openstack image show -c visibility -c status <SnapShotName>

When status = active, the job can be started. The field visibiltiy has value private before first use, which afterwards changes to shared.

In general, a job should be started within 5 - 10 minutes. This time will increase if the queue is full. If the job is launched before the snap shot status is active, it might be stuck in the queue for a long time (for ever?).

Contact Seb on the CANFAR slack channel, he usually replies quickly, sometimes there are issues that only he can fix.