NIHR Biomedical Research Centre for Mental Health: Computational Biology
[Home] [HPC Cluster] [News] [Past Presentations] [Software] [Wiki]

cmpfastq

A simple perl program that allows the user to compare QC filtered fastq files

Download: [cmpfastq]
Written by: & .

Prior to alignment of next-generation sequencing data you should always perform quality control checks to ensure that the raw data looks good and there are no problems which could affect alignment accuracy, speed and increased false positives when calling new variants (fastqc- http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/). After initial QC checks it is often recommended to pre-process the fastq files using tools such as fastx (http://hannonlab.cshl.edu/fastx_toolkit/) to remove poor quality reads, trim adapter sequences from reads, remove sequencing artifacts and adaptor only reads etc.

For paired end data, next-generation sequencing machines produce two FASTQ files, containing multiple short-reads sequences with quality information for each “end” eg for Illumina data:

  1. s_1_1_sequence.txt – lane 1 read/end number 1
  2. s_1_2_sequence.txt – lane 1 read/end number 2

When using tools such as fastx to QC filter your data, the two reads are processed separately. This can often result in one of the paired ends being filtered out from the analysis, leaving two files with unequal number of reads and un-matched pairs. (“single-end reads”). The next-generation sequencing aligners do not allow for a mixture of paired-end and single-end alignments. Therefore, before aligning the QC filtered data the two processed “paried-end” files need to be compared to identify “paired” and “un-paired” reads. The files can then be processed as separate paired-end and single-end data and then merged post alignment.

cmpfastq is a simple perl program that allows the user to compare QC filtered fastq files. It will identify “paired” and “un-paired” reads and write out 2 results files for each filtered paired-end fastq file:

  1. <1st filename>-common.out – contains reads that are still paired post QC filtering
  2. <2nd filename>-unique.out – contains reads that are no-longer paired post QC filtering

The “unique.out” reads can then effectively be processed as “single-end” data.
Usage:

cmpfastq s_1_1_fastx.QC.fastq  s_1_2_fastx.QC.fastq        

Output: