CLASHChimeras¶
CLASHChimeras is a Python package for analysing CLASH datasets. It takes raw fastq files as input and provides comprehensive analysis of RNA profiles and chimeric reads identification. The output is CSV and BED format files for easy visualization in Genome Browsers.
Installation¶
You can install it using pip after you have setup Python version 3.4 or above. Please use this guide for setting up Python if you have not done it already. After setting up Python and pip, you can run this on your shell
For local installation (Usually $HOME/.local
):
$ pip3 install --user CLASHChimeras
For global installation (Usually /usr/
):
Note
You should have sudo
privileges
$ sudo pip3 install CLASHChimeras
Dependencies¶
Warning
These dependencies must be satisfied if you want to use align-for-chimeras
CLASHChimeras requires certain software to be installed and setup before you can use it completely. The software you need to explicitly install are the following:
Usage¶
The package can be used by three executable scripts:
download-for-chimeras¶
Downloads required sequences and create bowtie2 indexes required for alignment
usage: An example usage is: download-for-chimeras -gor "H.sapiens" -mor hsa
- Options:
--gencodeOrganism=H.sapiens, -gor=H.sapiens Select model organism
Possible choices: H.sapiens, M.musculus
--mirbaseOrganism=hsa, -mor=hsa Select organism to download microRNAs for --path=~/db/CLASHChimeras, -pa=~/db/CLASHChimeras Location where all the database files and indexes will be downloaded --logLevel=INFO, -l=INFO Set logging level
Possible choices: INFO, DEBUG, WARNING, ERROR
--bowtieExecutable, -be Provide bowtie2 executable if it’s not present in your path --tophatExecutable, -te Provide Tophat executable if it’s not present in your path --miRNA=hairpin, -mi=hairpin Which miRNA sequences to align
Possible choices: mature, hairpin
align-for-chimeras¶
Warning
The input fastq is expected to be adapter trimmed and quality controlled
Note
Flexbar can be used to trim raw fastq sequences
Given a fastq file, this script executes bowtie2 and tophat aligners to generate alignment files necessary for detecting chimeras in the reads
usage: align-for-chimeras -i input.fastq -si /path/to/smallRNA_index -ti /path/to/targetRNA_index -o output -r bowtie2 align-for-chimeras -i input.fastq -gi /path/to/genome_index -tri /path/to/transcriptome_index -o output -r tophat To see detailed help, please run align-for-chimeras -h
- Options:
--input, -i Input file containing reads fastq --smallRNAIndex, -si Provide the smallRNA bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path -pa during initialize) --targetRNAIndex, -ti Provide the targetRNA bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path -pa during initialize) --genomeIndex, -gi Provide the genome bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path during initialize) --transcriptomeIndex, -tri Provide the transcriptome index as specified in tophat –transcriptome-index --output, -o The output name without extension (.sam .bam will be added) --run=bowtie2, -r=bowtie2 Run the following aligner for raw reads
Possible choices: bowtie2, tophat
--logLevel=INFO, -l=INFO Set logging level
Possible choices: INFO, DEBUG, WARNING, ERROR
--gzip=False, -gz=False Whether your input file is gzipped --bowtieExecutable, -be Provide bowtie2 executable if it’s not present in your path --tophatExecutable, -te Provide Tophat executable if it’s not present in your path --preset=sensitive-local, -p=sensitive-local Provide preset for bowtie2
Possible choices: very-fast, fast, sensitive, very-sensitive, very-fast-local, fast-local, sensitive-local, very-sensitive-local
--tophatPreset=very-sensitive, -tp=very-sensitive Provide preset for Tophat
Possible choices: very-fast, fast, sensitive, very-sensitive
--mismatch=1, -m=1 Number of seed mismatches as represented in bowtie2 as -N
Possible choices: 0, 1
--reverseComplement=False, -rc=False Align to reverse complement of reference as represented in bowtie2 as –norc --unaligned=False, -un=False Whether to keep unaligned reads in the output sam file. Represented in bowtie2 as –no-unal --threads=1, -n=1 Specify the number of threads
find-chimeras¶
Note
It’s recommended that you provide SAM files as input which are generated using align-for-chimeras
Todo
Provide support for detecting chimeras between same RNA types
Given two SAM files, this script tries to find chimeras that are observed between a smallRNA and a targetRNA
usage: An example usage is: find-chimeras -s smallRNA.sam -t targetRNA.sam -o output
- Options:
--smallRNA, -s Provide smallRNA alignment SAM file --targetRNA, -t Provide targetRNA alignment SAM file --getGenomicLocationsSmallRNA=False, -ggs=False Do you want genomic locations for small RNA? --getGenomicLocationsTargetRNA=False, -ggt=False Do you want genomic locations for target RNA? --smallRNAAnnotation, -sa Provide smallRNA annotation gtf(from Gencode) or gff3(from Mirbase). Only provide gtf from Gencode or gff3 from Mirbase. Does not support other gtf files --targetRNAAnnotation, -ta Provide targetRNA annotation gtf(from Gencode). Only provide gtf from Gencode. Does not support other gtf files --output, -o The output name without extension (.bed .csv will be added) --overlap=4, -ov=4 Maximum overlap to be set between two molecules when determining chimeras --gap=9, -ga=9 Maximum gap (number of unaligned nucleotides) to allowed between two molecules within a chimera --logLevel=INFO, -l=INFO Set logging level
Possible choices: INFO, DEBUG, WARNING, ERROR
Example¶
We will be using the a dataset from CLASH experiment which is hosted here
In this instance, we’ll be using the first 4 million reads from the dataset. The sequential order to find chimeras on CLASH datasets using this package is the following:
Run download-for-chimeras
¶
Run download-for-chimeras for the first time to download sequences and generate necessary indexes
The dataset that we are using here belong to H. sapiens. The sequence database needs to be downloaded from Gencode and miRBase. Here’s how you can download:
The code below assumes the default path as ~/db/CLASHChimeras
but if you
want a different folder to put your sequences, please specify it using
--path /path/to/your/folder
as a argument. It’s highly recommended to
get yourself familiar with the arguments by typing download-for-chimeras -h
$ download-for-chimeras -gor "H.sapiens" -mor hsa
Note
It’s an interactive script which prompts for user input when selecting the release version.
Warning
Please be patient as this is a big download and index generation takes even longer
Warning
The latest release from Gencode when downloaded and after all indexes are generated, takes around 11G of space
Below is an example of how download-for-chimeras runs.
Note
All the database files are already present in this example run, so they are verified by sha256sums. Thus, the timestamps are very close to each other. Actual download and generation of indexes will take a while
Indexes¶
There are a series of bowtie2 and tophat indexes generated after you’ve run
download-for-chimeras
script. Assuming that you ran the command below and
selected the latest versions of Gencode and miRBase, the following indexes
will be generated automatically
$ download-for-chimeras -gor "H.sapiens" -mor hsa
smallRNA & targetRNA Indexes¶
These indexes can be used as --smallRNAIndex -si
or --targetRNAIndex
-ti
in align-for-chimeras
Path for index | Index Type | RNA Type |
---|---|---|
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.pc_transcripts | Bowtie2 | protein_coding |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.lncRNA_transcripts | Bowtie2 | lncRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.snoRNA_transcripts | Bowtie2 | snoRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.snRNA_transcripts | Bowtie2 | snRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.tRNA_transcripts | Bowtie2 | tRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.misc_RNA_transcripts | Bowtie2 | misc_RNA |
~/db/CLASHChimeras/Mirbase/21/hsa-hairpin | Bowtie2 | miRNA-hairpin |
~/db/CLASHChimeras/Mirbase/21/hsa-mature | Bowtie2 | miRNA-mature |
Genome-Index¶
This index should be provided if you run align-for-chimeras
with
--run tophat
Path for index | Type |
---|---|
~/db/CLASHChimeras/Gencode/H.sapiens/22/GRCh38.p2.genome | Bowtie2 |
Transcriptome-Index¶
This index should be provided if you run align-for-chimeras
with
--run tophat
along with Genome-Index
Path for index | Type |
---|---|
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation | tophat |
Annotation¶
Annotation File | RNA type |
---|---|
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf | protein_coding |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf | lncRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf | snRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf | snoRNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf | misc_RNA |
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.tRNAs.gtf | tRNA |
~/db/CLASHChimeras/Mirbase/21/hsa.gff3 | miRNA |
Run align-for-chimeras
¶
Note
Please refer to Indexes when selecting --smallRNAIndex -si
or
targetRNAIndex -ti
when you run align-for-chimeras
For this instance, we want to find the chimeras between miRNA and
protein_coding from the raw reads. After you have successfully run
download-for-chimeras
and made sure that all the indexes are present for
your alignment to begin, please use the following command
$ align-for-chimeras -i E3_4M.fastq.gz -gz -r bowtie2 -si ~/db/CLASHChimeras/Mirbase/21/hsa-hairpin -ti ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.pc_transcripts -o E3-miRNA-pc
This is how it runs.
After the successful execution of align-for-chimeras
, these are the files
that are generated
- E3-miRNA-pc.smallRNA.sam
- E3-miRNA-pc.targetRNA.sam
Note
Please use --threads -n
to specify the number of cores to use
when executing Bowtie2
align-for-chimeras
also provides an argument to run tophat as well. This
helps in visualise the transcript coverage across the genome. Please use the
following command to align to the whole genome
$ align-for-chimeras -i E3_4M.fastq.gz -gz -r tophat -gi ~/db/CLASHChimeras/Gencode/H.sapiens/22/GRCh38.p2.genome -tri ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation -o E3_4M
To create bigWig file from the tophat output, I’d recommend using deepTools to create normalized coverage file from the following wiki page
Let’s move forward with finding chimeras between these RNA types
Run find-chimeras
¶
Note
Please refer to Annotation when selection --smallRNAAnnotation
-si
or --targetRNAIndex -ti
when you run find-chimeras
Following up after running align-for-chimeras
, it’s time to detect chimeras.
Please make sure that you have the SAM files generated from
align-for-chimeras
, please use the following command
$ find-chimeras -s E3-miRNA-pc.smallRNA.sam -t E3-miRNA-pc.targetRNA.sam -ggs -sa ~/db/CLASHChimeras/Mirbase/21/hsa.gff3 -ggt -ta ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf -o E3-miRNA-pc
This is how the above command runs
After the successful execution of find-chimeras
, these are the files that
are generated
- E3-miRNA-pc.chimeras.tsv
- E3-miRNA-pc.smallRNA.bed
- E3-miRNA-pc.targetRNA.bed
Note
Please note if you have not specified
--getGenomicLocationsSmallRNA -ggs
, <sample>.smallRNA.bed
will not be
generated. If you haven’t specified --getGenomicLocationsTargetRNA -ggt
,
<sample>.targetRNA.bed
will not be generated.
You can view the chimeras from the <sample>.chimeras.tsv
file that is
generated. If you want to visualize the data in genome browsers, you can do
that by adding the <sample>.smallRNA.bed
and <sample>.targetRNA.bed
in the IGV or your genome browser of choice.
Note
Please check the genome assembly version described in Genome-Index and make sure you have the same or corresponding version set in your genome browser
Possible combinations¶
Because of the modular design of the software, it is possible to find chimeras between different types of RNA. Please refer to Indexes and run align-for-chimeras with the smallRNA and targetRNA of your choice.
Visualisation in Genome Browser¶
This is an example visualization in IGV with the normalized coverage included as a track
Chimeras table¶
Here is the example chimeras table that is generated. The columns information can be found commented in the first lines
Issues & Feedback¶
If you encounter any issues, please report it on the Issues page of the Github repository. Please feel free to offer your suggestions and feedback and contribute by submitting pull requests.