List of FDA-approved drugs and CNS drugs with SMILES

short story

Here is a table listing FDA-approved drugs (1691) and 3 columns:

generic_name: Generic name from DrugBank
cns_drug: TRUE/FALSE, based on whether any of 5 insurance plans I checked had the drug listed under “Central Nervous System agents” or similar
smiles: isomeric SMILES from DrugBank. When this field is empty it’s usually because the drug is a large biomolecule.

Alternately here is just a straight list of FDA-approved CNS drugs (221).

These lists are current as of October 4, 2013 but cannot be guaranteed accurate or exhaustive; use at your own risk. For details on how and why I created these, read on.

motivation

Following up on my recent foray into cheminformatics, I thought it would be nice to have simple list of FDA-approved drugs. This would be nice for plotting the physicochemical properties of molecules of interest to see how they overlap with the space of drugs, while also being able to spot-check any individual drug and ask “what is that?”. Since I’m interested in the central nervous system, it would be especially nice to have drugs flagged with a TRUE/FALSE as to whether they’re CNS drugs, since the need for blood-brain barrier permeability limits these to a much smaller chemical space.

Any other chemical properties I need later I can always get from the SMILES using rcdk.

first try

Surely some simple list like this already exists on the internet, but it’s been surprisingly difficult to find.

When I Googled “list of FDA-approved drugs” at first I only found hits like Drugs@FDA (searchable but not downloadable and no chemical information), FDA’s NDC directory (downloadable database with regulatory info and common names but no chemical info; also it lists substances such as “WATER”), Medilexicon’s Drug Search (no chemical data and all human-readable HTML, no machine-readable data). ZINC, meanwhile, has lists of FDA-approved drugs with SMILES and chemical data from the MSDI US Drugs Collection and the EPA’s DSSTOX toxicology study of FDA-approved drugs. But these, conversely, don’t have common names.

Finally I looked at the EPA’s own DSSTOX webpage and downloaded this file. Finally: this file has both SMILES and common names. But I was a bit disturbed that it doesn’t seem to match the data from the same EPA study as housed at ZINC:

options(stringsAsFactors=FALSE)

# from http://zinc.docking.org/catalogs/fda
zinc_fda_prop = read.table('c:/sci/037cinf/data/EPA-DSSTOX/fda_prop.xls',header=TRUE,sep='\t',quote="",comment.char="")
# unzipped from http://www.epa.gov/ncct/dsstox/StructureDataFiles/FDAMDD_DownloadFiles/FDAMDD_v3b_1216_15Feb2008.zip and exported to plain text
fda_annotations = read.table('c:/sci/037cinf/data/EPA-DSSTOX/FDAMDD_v3b_1216_15Feb2008_nostructures.txt',header=TRUE,sep='\t',quote="",comment.char="")

dim(fda_annotations) # 1216 34
dim(zinc_fda_prop) # 3180 14

# can we join on SMILES?
sum(fda_annotations$STRUCTURE_SMILES %in% zinc_fda_prop$SMILES) # 2
sum(fda_annotations$STRUCTURE_Parent_SMILES %in% zinc_fda_prop$SMILES) # 4
# not a case issue:
sum(toupper(fda_annotations$STRUCTURE_Parent_SMILES) %in% toupper(zinc_fda_prop$SMILES)) # 4

# can we join on molecular weight?
sum(fda_annotations$STRUCTURE_MolecularWeight %in% zinc_fda_prop$MWT) #2
# ZINC rounds to 3 decimal places, EPA to only 4. 
sum(round(fda_annotations$STRUCTURE_MolecularWeight,3) %in% 
      round(zinc_fda_prop$MWT,3)) # 24 matches when rounded to 3 decimal places
sum(round(fda_annotations$STRUCTURE_MolecularWeight,2) %in% 
      round(zinc_fda_prop$MWT,2)) # 302 matches when rounded to 2 decimal places
sum(round(fda_annotations$STRUCTURE_MolecularWeight,1) %in% 
      round(zinc_fda_prop$MWT,1)) # 768 matches when rounded to 1 decimal place

# but rounding to 1 decimal place means the match is no longer unique b/c 860 < 1216
length(unique(round(fda_annotations$STRUCTURE_MolecularWeight,1))) # 860

The two datasets are of different length (1216 vs. 3180 molecules). Almost none of the SMILES match exactly. I casted around for a different field to join on, but the only other attribute shared between the two tables was molecular weight. This gave very few matches, which might be a rounding issue but as I rounded to fewer and fewer decimal places I lost the one-to-one match before I got complete matching.

I spot-checked the first couple items in the list I downloaded from EPA, Acemetacin and Acenocoumarol, and the molecular weights listed on Wikipedia indeed matched what’s in the EPA’s table. Acenocoumarol’s molecular weight is 353.326, so I sorted the ZINC version of the table to find it, and there appeared to be two copies of it, ZINC00538563 and ZINC00538577, neither of which turned up in ZINC’s own search bar, and Googling them wasn’t helpful. Maybe the duplicate entries are the reason ZINC had so many more rows than the EPA’s version.

So I decided to start from the EPA’s version of the table. Since this contained common names, I figured that was a good place to start casting around for a list of CNS drug. The EPA table does have a TherapeuticCategory field, but there are 319 distinct values – too many to go through and manually pick out which ones are CNS-related. So when I Googled “list of CNS drugs” the most helpful thing I found was a PDF of drugs covered under United Health Care insurance, with CNS drugs listed on pp 13-19. This seemed like a good place to start. I copied and pasted the text on pp 13-19 into a text file, read it into R, collapsed it to a single string and then grepped against it:

# https://www.unitedhealthcareonline.com/ccmcontent/ProviderII/UHC/en-US/Assets/ProviderStaticFiles/ProviderStaticFilesPdf/Tools%20and%20Resources/Pharmacy%20Resources/PDL_Phys_Bk.pdf
uhc_cns_drug_list = read.table('c:/sci/037cinf/data/uhc-cns-drugs-pdf-list.txt',header=FALSE,sep="|") # read file, choose a delimiter absent from the file
cns_drug_string = paste(uhc_cns_drug_list$V1, sep=" ", collapse=" ") # collapse it into a single huge string
cns_drug_string = tolower(cns_drug_string) # convert to lowercase

fda_annotations$cns_drug = FALSE # initialize a column for whether a cns drug or not
for (i in 1:dim(fda_annotations)[1]) { # loop through b/c grep doesn't support vector operations.
  fda_annotations$cns_drug[i] = length(grep(tolower(fda_annotations$TestSubstance_ChemicalName[i]),cns_drug_string)) > 0 
}
which(fda_annotations$cns_drug) # list of indices
fda_annotations$TestSubstance_ChemicalName[fda_annotations$cns_drug] # list of names

I got 122 hits – not bad! A paper I had run across discussed only 74 [Pajouhesh & Lenz 2005], so I figured I might be in the right ballpark. But on spot-checking it, I noticed that aripiprazole (Abilify) was conspicuously absent – hey, it’s only the best-selling drug in the United States as of 2013. On further investigation, it turns out Abilify is covered by UHC (see p. 19), it’s just not in the EPA’s list to begin with.

second try

Obviously I needed a more exhaustive drug list to start with. At this point I became aware of DrugBank, which strives to be exactly the resource I was looking for to begin with. (Note: although DrugBank is Canadian, according to About its definition of “Approved” is FDA-approved). DrugBank is incredibly exhaustive – so much so that it ends up not being formatted in a user-friendly way. And even DrugBank does not have any easy machine-readable way of determining whether something is a CNS drug. So I had two tasks: to distill DrugBank down to a few columns, and to figure out which ones were CNS drugs.

First, I downloaded the “DrugCard” formatted version of the DrugBank database and used command line tools to extract the generic name, list of brand names, isomeric SMILES string, “Drug_Type” (approved vs. experimental, with some other designations as well) and indication (a free text string I didn’t end up using):

wget http://www.drugbank.ca/system/downloads/current/drugbank.txt.zip
unzip drugbank.txt.zip
cat drugbank.txt | grep -A 2 Generic_Name | awk 'NR%4 == 2 {print $0}' > generic_names.txt
cat drugbank.txt | grep -A 4 ^"# Brand_Names" | tr '\n' ',' | sed 's/--/\n/g' | sed 's/# Brand_Names://' | sed 's/(.*//' | sed 's/#.*//' > brand_names.txt
cat drugbank.txt | grep -A 2 ^"# Smiles_String_isomeric" | awk 'NR%4 == 2 {print $0}' | sed 's/Not Available//' > smiles.txt
cat drugbank.txt | grep -A 2 ^"# Drug_Type" | awk 'NR%4 == 2 {print $0}' > first_group.txt
cat drugbank.txt | grep -A 2 ^"# Indication" | awk 'NR%4 == 2 {print $0}' > indications.txt
paste -d '|' generic_names.txt brand_names.txt first_group.txt smiles.txt indications.txt > drugbank.5col.pipe

A few notes on some command line tricks above. awk 'NR%4 ==2' processes only the third line (0, 1, 2, 3) in each 4-line grep hit. paste is like a sideways cat. In the brand names line, grep -A 4 reads 4 lines trailing after the grep hit, in order to grab up to 4 different brand names. tr replaces the newlines with commas to comma-separate the ≤ 4 resulting hits (see here). sed 's/--/\n/g' replaces the -- which separates each grep hit with a newline, thus restoring one line per drug. The remaining sed commands just remove extraneous text (see here).

With that, I was done with the command line and read the table into R.

drugbank = read.table('c:/sci/037cinf/data/drugbank.txt/drugbank.5col.pipe',sep='|',header=FALSE,quote="",comment.char="")
colnames(drugbank) = c("generic_name","brand_names","first_group","smiles","indication")

Next I decided I wanted a more exhaustive list of CNS drugs than whatever United Health Care happens to cover. I Googled “list of central nervous system prescription drugs covered by insurance” and managed to find five insurance plans that gave exhaustive categorized lists of the drugs they cover, including a “Central Nervous System agents” category or similar: BCBS Texas, CareMark, Health Alliance, Humana Enhanced (I arbitrarily chose Wisconsin 2014), plus the original United Health Care document (all downloaded on October 4, 2013). The Humana document is exceptionally comprehensive, with 31 pages (pp 64-94) devoted exclusively to listing CNS drugs in 5 different tiers including “non-preferred” drugs.

I copied and pasted the CNS drugs list from each PDF into a text file and collapsed them into one enormous string in R:

# in read.table, choose a delimiter absent from the file such as | in order to just read all as one column.
uhc_cns_drug_list      = read.table('c:/sci/037cinf/data/uhc-cns-drugs-pdf-list.txt',header=FALSE,sep="|") # # https://www.unitedhealthcareonline.com/ccmcontent/ProviderII/UHC/en-US/Assets/ProviderStaticFiles/ProviderStaticFilesPdf/Tools%20and%20Resources/Pharmacy%20Resources/PDL_Phys_Bk.pdf 
humana_cns_drug_list   = read.table('c:/sci/037cinf/data/humana_2014_wi.txt',header=FALSE,sep="|") # http://apps.humana.com/marketing/documents.asp?file=2168465
ha_cns_drug_list       = read.table('c:/sci/037cinf/data/healthalliance-2013.txt',header=FALSE,sep="|") # https://www.healthalliance.org/media/HAMP_formulary.pdf%E2%80%8E
bcbst_cns_drug_list    = read.table('c:/sci/037cinf/data/bcbst-2014.txt',header=FALSE,sep="|") # http://www.bcbstx.com/pdf/drug_list.pdf
caremark_cns_drug_list = read.table('c:/sci/037cinf/data/caremark-oct2013.txt',header=FALSE,sep="|")  # http://www.caremark.com/portal/asset/caremark_drug_list.pdf
all_cns_drug_lists = rbind(uhc_cns_drug_list, humana_cns_drug_list, ha_cns_drug_list, bcbst_cns_drug_list, caremark_cns_drug_list) # concatenate all together
cns_drug_string = paste(all_cns_drug_lists$V1, sep=" ", collapse=" ") # collapse it into a single huge string
cns_drug_string = tolower(cns_drug_string) # convert to lowercase

To check my DrugBank list against the insurance plan coverage lists, I needed a few tricks. grep with \\b does whole-word matching – otherwise, “iron” matches “buspirone”. Non-alphanumeric characters in brand names appeared extraneous, so I removed them with gsub("[^[:alnum:] ]", "", str). Braces, brackets and parentheses in generic names always seemed to occur in IUPAC chemical names which would never be in the insurance plan list anyway, so I just skipped those.

drugbank$cns_drug = FALSE # initialize a column for whether a cns drug or not
for (i in 1:dim(drugbank)[1]) { # loop all drugs.  grep doesn't support vector operations.
  # check for generic names
  generic_name_present = FALSE # initialize
  # some generic names are chemical names with []{}() characters which mess up regexes
  # and obviously aren't called by that name in any insurance plans anyway, so skip those
  if(!length(grep("\\}|\\{|\\[|\\]|\\(|\\)",drugbank$generic_name[i])) > 0) {
    # if grep finds a match between the DrugBank generic name and the cns drug list, set TRUE
    generic_name_present = length(grep(paste("\\b",tolower(drugbank$generic_name[i]),"\\b",sep=""),cns_drug_string)) > 0 
  }
  # check for brand names
  brand_name_present = FALSE # initizalize
  brand_name_vector = strsplit(drugbank$brand_names[i],split=',')[[1]] # split the list of brand names on comma
  # remove empty and "Not Available" entries
  brand_name_vector = brand_name_vector[-which(brand_name_vector=="" | brand_name_vector=="Not Available")]
  # loop through all brand names
  for (brand_name in brand_name_vector) { 
    # some brand names in DrugBank are listed with characters like [] that mess up regexes, so:
    brand_name_alphanumeric = gsub("[^[:alnum:] ]", "", brand_name) # remove nonalphanumeric characters
    # now check if it's in the CNS drug list string
    if (length(grep(paste("\\b",tolower(brand_name_alphanumeric),"\\b",sep=""),cns_drug_string)) > 0) {
      brand_name_present = TRUE
    }
  }
  # either a generic or brand name hit counts for setting cns_drug to TRUE
  drugbank$cns_drug[i] = generic_name_present | brand_name_present
}

which(drugbank$cns_drug) # list of indices
drugbank$generic_name[drugbank$cns_drug & drugbank$first_group == 'Approved'] # list of names

Casting a wider net for insurance plans paid off – I got 221 FDA-approved CNS drugs when grepping against all five insurance plans, compared to 155 with only UHC. And of course, the more exhaustive DrugBank list paid off too – aripiprazole is on the new list, and I’ve got vastly more than the original 122 hits.

I certainly can’t make any guarantees about this new list. I think it’s at least pretty decent because with a bit of spot-checking I haven’t yet been able to think of a CNS drug I know of that isn’t on the list, nor have I been able to find an item on the list that isn’t a CNS drug.

I wrote the lists out to files:

drug_list = drugbank[drugbank$first_group == 'Approved', c('generic_name','cns_drug','smiles')]
drug_list_ordered = drug_list[with(drug_list,order(generic_name)),]
dim(drug_list_ordered) # 1691 3
write.table(drug_list_ordered,'drugs.txt',sep='\t',row.names=FALSE,col.names=TRUE,quote=FALSE)

cns_list = sort(drugbank$generic_name[drugbank$cns_drug & drugbank$first_group=='Approved'])
write.table(cns_list,'cnsdrugs.txt',row.names=FALSE,col.names=FALSE,quote=FALSE)

And then uploaded to this blog post. Here they are again: drugs.txt cnsdrugs.txt