Introduction

This pipeline is developed for annotating and profiling small RNAs with UMI attached to both ends. The raw data are reads in FASTQ format. After preprocessing and mapping to reference libraries, we can statistic the reads. Particularly, the reads which can be aligned to the genome but cannot be aligned to RNA sequences databases will be filtered out and written to a CSV format file. The following figure presents a general workflow of the pipeline.

Methods

Preprocessing

The preprocessing phase includes three parts: adapter extraction, quality control, UMI extraction. At first, adapters in 3’ end and 5’ end are removed by Cutadapt and then only reads whose length within 27~55nt are retained. Finally, UMIs in both 3’ end and 5’ end are extracted from sequence reads and add to read names with UMI-tools. At the UMI extraction step, reads with poor quality are dropped at the same time by UMI-tools.

Software and parameters used in this phase:

Software	Parameter	Effect
Cutadapt	-j 40	use 40 CPU cores
	-a AGATCGGAAGAGCACACGTC	remove 3’ end adapter: AGATCGGAAGAGCACACGTC
	-g GTTCAGAGTTCTACAGTCCGACGATC	remove 5’ end adapter: GTTCAGAGTTCTACAGTCCGACGATC
	—overlap=3	require at least three bases match between adapter and read
	—error-rate=0.1	the level of error tolerance (mismatch) in adapter sequences searching
	—times=1	trim no more than one adapter from each reads
	—minimum-length=27 —maximum-length=55	drop reads shorter than 27nt or longer than 55 nt
UMI-tools extract	—bc-pattern=NNNNNNNNN —3prime	extract 9nt as UMI from 3’ end
	—bc-pattern=NNNNNN	extract 6nt as UMI from 5’ end
	—quality-filter-threshold 20	filter low quality reads

Mapping

The preprocessed reads are still in FASTQ format. The reads are aligned to genome and RNA databases with Bowtie, respectively. The following table lists the reference libraries we use in alignment. After aligning to references, the duplicate reads (reads with the same alignment position and UMI) will be dropped by umi_tools. After mapping, we can get the annotated reads in SAM format.

Genome	RNA databases(sort by mapping order)
hg19	mature tRNA
	tRNA
	rRNA
	miRNA
	piRNA
	Rfam

Software and parameters used in this phase:

Software	Parameter	Effect
Bowtie	-q	input format: FASTQ
	—threads 40	run in 40 threads
	-v 1	allow 1 mismatch
	-m 1	each reads can only be aligned once
	-a	keep all mapping results
	-un	write all reads that could not be aligned to a file
UMI-tools dedup	—method=unique	do not consider the sequencing mistakes in UMIs
	—read-length	duplicate reads must have the same length

Analysis

With the SAM format files in the mapping phase, we can count the reads. The reads with the same annotation will be counted as the same RNA. Pie graphs to visualize the percentage of different types of RNAs are plotted based on the counts. Besides counting the reads that mapping to the RNA databases, we can also filter the sequences that can be aligned to the genome but not to RNA databases. The statistics results are saved in y_rnadb_y_genome.csv and n_rnadb_y_genome.csv. The statistic and visualization are conducted with Python scripts.

Implement

The preprocessing phase and mapping phase are implemented by Shell scripts, and the analysis phase is implemented by Python scripts. The source code and more details of these scripts can be found in SmallRNASeq.

Full analysis results of samples can be found in the attachment.