FastQC for large numbers of samples

Suppose you’re analyzing exome data, all the way from FASTQ files as your starting point. Guides such as the how-to at seqanswers will suggest that somewhere near the beginning of your pipeline you should do some quality control on the FASTQ files themselves. Of course, after you align your reads to a genome you will look at quality control metrics such as coverage, but this is more fundamental than that– this is looking at the quality of the reads themselves that are coming off of your sequencer. The most widely used tool for this is FastQC. It’s pretty easy to use and there is a nice instructional video which helps explain how to interpret the results:

FastQC runs in under an hour on a ~2GB .fq.gz file (you don’t need to unzip before running) and outputs an HTML file with graphs and images showing the quality of your data. Graphically it’s pretty slick, but if you have hundreds and hundreds of samples, you’re not going to open up HTML page and pore over the results with your own two eyes. You need some way of looking at these data in aggregate.

Luckily FastQC also outputs a plain text version of its results, which, although not quite database-ready out of the box, is easy to convert. Today I wrote a Python script to walk a directory, find all the FastQC output files and scrape the information into a PostgreSQL database. Because the schema of results is different for every test that FastQC runs and I didn’t want to create 10 different tables, I kept it simple and just read all the details into a couple of plain text columns.

import os
import sys
import psycopg2

conn = psycopg2.connect(database="mydb", user="postgres", password="password")
c = conn.cursor()

create_table_sql = """
drop table if exists fastqc_summary;
create table fastqc_summary (
fileid varchar,
module varchar,
status varchar
);
drop table if exists fastqc_details;
create table fastqc_details (
id serial primary key,
fileid varchar,
module varchar,
col1 varchar,
ocols varchar
);
"""

c.execute(create_table_sql)
conn.commit()

for root, dirs, files in os.walk("c:/your/dir/here/"): # walk a directory containing FastQC output for multiple samples
    for name in files:
        if (name == "fastqc_data.txt"):
            fileid = name # use string slicing here if you only want part of the filename as the id
            with open(os.path.join(root,name),"r") as f: # automatically close the file when done
                for line in f.readlines():
                    line = line.strip() # trim endline
                    if (line[:2] == ">>" and line[:12] != ">>END_MODULE"): # for each module grab summary data
                        module = line[2:-5] # grab module name
                        status = line[-4:] # and overall status pass/warn/fail
                        sql = "insert into fastqc_summary(fileid,module,status) values(%s,%s,%s);"
                        data = (fileid,module,status)
                        c.execute(sql,data)
                    elif (line[:2] != ">>" and line[:2] != "##"): # grab details under each module
                        cols = line.split("\t")
                        col1 = cols[0]
                        ocols = "|".join(cols[1:])
                        sql = "insert into fastqc_details(fileid,module,col1,ocols) values(%s,%s,%s,%s);"
                        data = (fileid,module,col1,ocols)
                        c.execute(sql,data)

conn.commit()
c.close()
conn.close()

Once you’ve got this data read into PostgreSQL, it’s easy to figure out what (if anything) is wrong with your data. Here are a few SQL queries I found useful as a starting point:

-- how many modules passed, warned and failed
select status, count(*) n from fastqc_summary group by status;

-- see which modules failed
select * from fastqc_summary where status = 'fail';

-- get all details for failed modules
select   *
from     fastqc_summary s, fastqc_details d
where    s.fileid = d.fileid
and      s.module = d.module
and      s.status = 'fail'
order by id;

-- see which modules failed in the most samples
select module, count(*) n from fastqc_summary where status = 'fail' group by module order by module;

-- see which modules warned in the most samples
select module, count(*) n from fastqc_summary where status = 'warn' group by module order by module;

-- see which samples had one or more failed modules
select fileid, count(*) n from fastqc_summary where status = 'fail' group by fileid order by fileid;

-- see which samples had one or more modules with failure or warning
select fileid, count(*) n from fastqc_summary where status in('fail','warn') group by fileid order by fileid;

-- see quantity of data for each sample
select * from fastqc_details where col1 = 'Total Sequences' order by ocols;

-- see GC content for each sample
select * from fastqc_details where col1 = '%GC' order by ocols;