CLASHChimeras¶

CLASHChimeras is a Python package for analysing CLASH datasets. It takes raw fastq files as input and provides comprehensive analysis of RNA profiles and chimeric reads identification. The output is CSV and BED format files for easy visualization in Genome Browsers.

Installation¶

You can install it using pip after you have setup Python version 3.4 or above. Please use this guide for setting up Python if you have not done it already. After setting up Python and pip, you can run this on your shell

For local installation (Usually $HOME/.local):

$ pip3 install --user CLASHChimeras

For global installation (Usually /usr/):

Note

You should have sudo privileges

$ sudo pip3 install CLASHChimeras

Dependencies¶

Warning

These dependencies must be satisfied if you want to use align-for-chimeras

CLASHChimeras requires certain software to be installed and setup before you can use it completely. The software you need to explicitly install are the following:

Bowtie2 - Fast and sensitive read alignment
Tophat - A spliced read mapper for RNA-Seq

Usage¶

The package can be used by three executable scripts:

download-for-chimeras
align-for-chimeras
find-chimeras

download-for-chimeras¶

Downloads required sequences and create bowtie2 indexes required for alignment

usage: An example usage is: download-for-chimeras -gor "H.sapiens" -mor hsa

Options:

`--gencodeOrganism=H.sapiens, -gor=H.sapiens`
	Select model organism Possible choices: H.sapiens, M.musculus
`--mirbaseOrganism=hsa, -mor=hsa`
	Select organism to download microRNAs for
`--path=~/db/CLASHChimeras, -pa=~/db/CLASHChimeras`
	Location where all the database files and indexes will be downloaded
`--logLevel=INFO, -l=INFO`
	Set logging level Possible choices: INFO, DEBUG, WARNING, ERROR
`--bowtieExecutable, -be`
	Provide bowtie2 executable if it’s not present in your path
`--tophatExecutable, -te`
	Provide Tophat executable if it’s not present in your path
`--miRNA=hairpin, -mi=hairpin`
	Which miRNA sequences to align Possible choices: mature, hairpin

align-for-chimeras¶

Warning

The input fastq is expected to be adapter trimmed and quality controlled

Note

Flexbar can be used to trim raw fastq sequences

Given a fastq file, this script executes bowtie2 and tophat aligners to generate alignment files necessary for detecting chimeras in the reads

usage: 
 align-for-chimeras -i input.fastq -si /path/to/smallRNA_index -ti /path/to/targetRNA_index -o output -r bowtie2 
 align-for-chimeras -i input.fastq -gi /path/to/genome_index -tri /path/to/transcriptome_index -o output -r tophat 
 
 
 To see detailed help, please run 
 align-for-chimeras -h

Options:

`--input, -i`	Input file containing reads fastq
`--smallRNAIndex, -si`
	Provide the smallRNA bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path -pa during initialize)
`--targetRNAIndex, -ti`
	Provide the targetRNA bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path -pa during initialize)
`--genomeIndex, -gi`
	Provide the genome bowtie2 index (Usually resides in ~/db/CLASHChimeras or elsewhere if you have specified in –path during initialize)
`--transcriptomeIndex, -tri`
	Provide the transcriptome index as specified in tophat –transcriptome-index
`--output, -o`	The output name without extension (.sam .bam will be added)
`--run=bowtie2, -r=bowtie2`
	Run the following aligner for raw reads Possible choices: bowtie2, tophat
`--logLevel=INFO, -l=INFO`
	Set logging level Possible choices: INFO, DEBUG, WARNING, ERROR
`--gzip=False, -gz=False`
	Whether your input file is gzipped
`--bowtieExecutable, -be`
	Provide bowtie2 executable if it’s not present in your path
`--tophatExecutable, -te`
	Provide Tophat executable if it’s not present in your path
`--preset=sensitive-local, -p=sensitive-local`
	Provide preset for bowtie2 Possible choices: very-fast, fast, sensitive, very-sensitive, very-fast-local, fast-local, sensitive-local, very-sensitive-local
`--tophatPreset=very-sensitive, -tp=very-sensitive`
	Provide preset for Tophat Possible choices: very-fast, fast, sensitive, very-sensitive
`--mismatch=1, -m=1`
	Number of seed mismatches as represented in bowtie2 as -N Possible choices: 0, 1
`--reverseComplement=False, -rc=False`
	Align to reverse complement of reference as represented in bowtie2 as –norc
`--unaligned=False, -un=False`
	Whether to keep unaligned reads in the output sam file. Represented in bowtie2 as –no-unal
`--threads=1, -n=1`
	Specify the number of threads

find-chimeras¶

Note

It’s recommended that you provide SAM files as input which are generated using align-for-chimeras

Todo

Provide support for detecting chimeras between same RNA types

Given two SAM files, this script tries to find chimeras that are observed between a smallRNA and a targetRNA

usage: An example usage is: find-chimeras -s smallRNA.sam -t targetRNA.sam -o output

Options:

`--smallRNA, -s`	Provide smallRNA alignment SAM file
`--targetRNA, -t`
	Provide targetRNA alignment SAM file
`--getGenomicLocationsSmallRNA=False, -ggs=False`
	Do you want genomic locations for small RNA?
`--getGenomicLocationsTargetRNA=False, -ggt=False`
	Do you want genomic locations for target RNA?
`--smallRNAAnnotation, -sa`
	Provide smallRNA annotation gtf(from Gencode) or gff3(from Mirbase). Only provide gtf from Gencode or gff3 from Mirbase. Does not support other gtf files
`--targetRNAAnnotation, -ta`
	Provide targetRNA annotation gtf(from Gencode). Only provide gtf from Gencode. Does not support other gtf files
`--output, -o`	The output name without extension (.bed .csv will be added)
`--overlap=4, -ov=4`
	Maximum overlap to be set between two molecules when determining chimeras
`--gap=9, -ga=9`	Maximum gap (number of unaligned nucleotides) to allowed between two molecules within a chimera
`--logLevel=INFO, -l=INFO`
	Set logging level Possible choices: INFO, DEBUG, WARNING, ERROR

Example¶

We will be using the a dataset from CLASH experiment which is hosted here

In this instance, we’ll be using the first 4 million reads from the dataset. The sequential order to find chimeras on CLASH datasets using this package is the following:

Run `download-for-chimeras`¶

Run download-for-chimeras for the first time to download sequences and generate necessary indexes

The dataset that we are using here belong to H. sapiens. The sequence database needs to be downloaded from Gencode and miRBase. Here’s how you can download:

The code below assumes the default path as ~/db/CLASHChimeras but if you want a different folder to put your sequences, please specify it using --path /path/to/your/folder as a argument. It’s highly recommended to get yourself familiar with the arguments by typing download-for-chimeras -h

$ download-for-chimeras -gor "H.sapiens" -mor hsa

Note

It’s an interactive script which prompts for user input when selecting the release version.

Warning

Please be patient as this is a big download and index generation takes even longer

Warning

The latest release from Gencode when downloaded and after all indexes are generated, takes around 11G of space

Below is an example of how download-for-chimeras runs.

Note

All the database files are already present in this example run, so they are verified by sha256sums. Thus, the timestamps are very close to each other. Actual download and generation of indexes will take a while

Indexes¶

There are a series of bowtie2 and tophat indexes generated after you’ve run download-for-chimeras script. Assuming that you ran the command below and selected the latest versions of Gencode and miRBase, the following indexes will be generated automatically

$ download-for-chimeras -gor "H.sapiens" -mor hsa

smallRNA & targetRNA Indexes¶

These indexes can be used as --smallRNAIndex -si or --targetRNAIndex -ti in align-for-chimeras

Path for index	Index Type	RNA Type
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.pc_transcripts	Bowtie2	protein_coding
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.lncRNA_transcripts	Bowtie2	lncRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.snoRNA_transcripts	Bowtie2	snoRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.snRNA_transcripts	Bowtie2	snRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.tRNA_transcripts	Bowtie2	tRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.misc_RNA_transcripts	Bowtie2	misc_RNA
~/db/CLASHChimeras/Mirbase/21/hsa-hairpin	Bowtie2	miRNA-hairpin
~/db/CLASHChimeras/Mirbase/21/hsa-mature	Bowtie2	miRNA-mature

Genome-Index¶

This index should be provided if you run align-for-chimeras with --run tophat

Path for index	Type
~/db/CLASHChimeras/Gencode/H.sapiens/22/GRCh38.p2.genome	Bowtie2

Transcriptome-Index¶

This index should be provided if you run align-for-chimeras with --run tophat along with Genome-Index

Path for index	Type
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation	tophat

Annotation¶

Annotation File	RNA type
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf	protein_coding
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf	lncRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf	snRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf	snoRNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf	misc_RNA
~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.tRNAs.gtf	tRNA
~/db/CLASHChimeras/Mirbase/21/hsa.gff3	miRNA

Run `align-for-chimeras`¶

Note

Please refer to Indexes when selecting --smallRNAIndex -si or targetRNAIndex -ti when you run align-for-chimeras

For this instance, we want to find the chimeras between miRNA and protein_coding from the raw reads. After you have successfully run download-for-chimeras and made sure that all the indexes are present for your alignment to begin, please use the following command

$ align-for-chimeras -i E3_4M.fastq.gz -gz -r bowtie2 -si ~/db/CLASHChimeras/Mirbase/21/hsa-hairpin -ti ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.pc_transcripts -o E3-miRNA-pc

This is how it runs.

After the successful execution of align-for-chimeras, these are the files that are generated

E3-miRNA-pc.smallRNA.sam
E3-miRNA-pc.targetRNA.sam

Note

Please use --threads -n to specify the number of cores to use when executing Bowtie2

align-for-chimeras also provides an argument to run tophat as well. This helps in visualise the transcript coverage across the genome. Please use the following command to align to the whole genome

$ align-for-chimeras -i E3_4M.fastq.gz -gz -r tophat -gi ~/db/CLASHChimeras/Gencode/H.sapiens/22/GRCh38.p2.genome -tri ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation -o E3_4M

To create bigWig file from the tophat output, I’d recommend using deepTools to create normalized coverage file from the following wiki page

Let’s move forward with finding chimeras between these RNA types

Run `find-chimeras`¶

Note

Please refer to Annotation when selection --smallRNAAnnotation -si or --targetRNAIndex -ti when you run find-chimeras

Following up after running align-for-chimeras, it’s time to detect chimeras. Please make sure that you have the SAM files generated from align-for-chimeras, please use the following command

$ find-chimeras -s E3-miRNA-pc.smallRNA.sam -t E3-miRNA-pc.targetRNA.sam -ggs -sa ~/db/CLASHChimeras/Mirbase/21/hsa.gff3 -ggt -ta ~/db/CLASHChimeras/Gencode/H.sapiens/22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf -o E3-miRNA-pc

This is how the above command runs

After the successful execution of find-chimeras, these are the files that are generated

E3-miRNA-pc.chimeras.tsv
E3-miRNA-pc.smallRNA.bed
E3-miRNA-pc.targetRNA.bed

Note

Please note if you have not specified --getGenomicLocationsSmallRNA -ggs, <sample>.smallRNA.bed will not be generated. If you haven’t specified --getGenomicLocationsTargetRNA -ggt, <sample>.targetRNA.bed will not be generated.

You can view the chimeras from the <sample>.chimeras.tsv file that is generated. If you want to visualize the data in genome browsers, you can do that by adding the <sample>.smallRNA.bed and <sample>.targetRNA.bed in the IGV or your genome browser of choice.

Note

Please check the genome assembly version described in Genome-Index and make sure you have the same or corresponding version set in your genome browser

Possible combinations¶

Because of the modular design of the software, it is possible to find chimeras between different types of RNA. Please refer to Indexes and run align-for-chimeras with the smallRNA and targetRNA of your choice.

Visualisation in Genome Browser¶

This is an example visualization in IGV with the normalized coverage included as a track

Chimeras table¶

Here is the example chimeras table that is generated. The columns information can be found commented in the first lines

Issues & Feedback¶

If you encounter any issues, please report it on the Issues page of the Github repository. Please feel free to offer your suggestions and feedback and contribute by submitting pull requests.

CLASHChimeras¶

Installation¶

Dependencies¶

Usage¶

download-for-chimeras¶

align-for-chimeras¶

find-chimeras¶

Example¶

Run download-for-chimeras¶

Indexes¶

smallRNA & targetRNA Indexes¶

Genome-Index¶

Transcriptome-Index¶

Annotation¶

Run align-for-chimeras¶

Run find-chimeras¶

Possible combinations¶

Visualisation in Genome Browser¶

Chimeras table¶

Issues & Feedback¶

Run `download-for-chimeras`¶

Run `align-for-chimeras`¶

Run `find-chimeras`¶