New: Release 4 File Series - May 2012
DTP Releases (December 2010), 2D/3D, with GUSAR Human Liver Microsomal Stability Prediction Data Added
This is the first one of a series of files which will be released over the next few months. These files will contain a successively curated structure set of all records of the Open NCI Database. The basis of this first file is the version of the Open NCI Database as provided by DTP in December 2010 ( 2D Coordinates SD File with 266,151 records). The file was processed in the following way:
- The originally provided data fields "Release", "Structure Source" and "Structure Evaluation" were preserved.
- All name fields of the original file were merged into one data field (""DTP names")
- Addition of hydrogen atoms was performed by CACTVS.
- 3D Atom coordinates have been calculated by CORINA (if the calculation failed, 2D coordinates were calculated by CACTVS).
- Data fields "Formula" and "Molecular Weight" were added (calculated by CACTVS).
- The IUPAC Structure Identifiers "Standard InChI" and "Standard InChIKey" (Version 1.04) were included as data fields.
- NCI/CADD's Structure Identifiers "FICTS", "FICuS", and "uuuuu" were calculated and added as data fields.
- The number of potential stereo centers on atoms and/or bonds has been included as data fields "Number of atom stereocenters" and "Number of bond stereo centers"; the additonal boolean field "Full atom and bond stereo specification" indicates whether full relative stereo configuration is available for the corresponding structure record (this field is missing if no stereo centers are present).
This succeeded for 265,242 of the 266,151 original structure records.
GUSAR QSAR Model Application for the prediction of human liver microsomal stability. Thirty five QSAR models created by GUSAR were used to generate a consesus prediction of the microsomal stability of the chemical structures contained in this file. Each compound in the file is classified as stable or unstable (data field "GUSAR Human Liver Microsomal Stability Prediction ). The prediction output also includes an assessment of the applicability domain as provided by GUSAR (data field "GUSAR Human Liver Microsomal Stability Prediction AD). This succeeded for 196,460 of the 265,242 structure records.
This version of the NCI Open Database, which adds ~15,000 new structures, is not included in our Enhanced NCI Database Browser web service. We are also aware that beyond that the PubChem version of the NCI database contains ~15,000 addtional structure records. We are currently in the process to analyze overlap between both sources.
265,242 structures in SDF format. This is a 198 MB gzipped file that uncompresses to about 1.2 GB.
DownloadRelease 3 Files - September 2003
September 2003 SD File of Combined DTP Releases, 2D/3D, with Canonical Properties Added
The most complete collection of Open NCI Database compounds as of September 2003 that we are aware of. These are 260,071 structures, combined from DTP releases from Oct. 1999, Aug. 2000, Feb. 2003, and Sep. 2003. All the identifier-type information that we were able to associate with the structures are included in this file: NSC numbers; DTP names for ~53,000 records (including some WLN strings); Unique SMILES, calculated by CACTVS according to Daylight's original (1989) canonicalization rules; the new IUPAC/NIST InChI chemical identifier (calculated with [beta] version 0.932 of NIST's program); IUPAC names, calculated with ACD/Lab's program ACD/Name Batch; eight different CACTVS hash codes, including a tautomer-invariant but stereochemistry-, multifragment-, charge- and isotope-sensitive hash code that is essentially a unique, calculable identifier for any (small-molecule) chemical. Additional properties, some of them helpful to categorize structures when dealing with several databases simultaneously, are explained in the Technical Notes.
The 2003 DTP releases now have many structures with at least some, if not full, stereochemistry specification. This allowed 3D coordinates of reliable stereoisomers to be calculated in many cases. Where such 3D structures would have potentially shown the wrong chemical, or would otherwise have been doubtful, 2D coordinates were kept. See the Technical Notes for more details. Also be aware of the fact that for a very large number of entries (on the order of 100,000), the structure shown in the 2003 DTP releases is slightly different from that shown in previous releases. In the vast majority of those cases, the structure is now represented as a different tautomer.
This version of the NCI Open Database, which adds ~10,000 new structures, is not included in our Enhanced NCI Database Browser web service.
260,071 structures in SDF format. This is a 214 MB gzipped file that uncompresses to about 1.6 GB.
DownloadRelease 2 Files - August 2000
August 2000 2D File
The "raw" structure data that were used to build the Release 2 of the Enhanced NCI Database Browser. These are 250,251 2D structures calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. The SMILES string and the CAS RN (where available) are also included for each structure.
250,251 structures in SDF format. This is a 90 MB gzipped file that uncompresses to about 982 GB.
DownloadNew in August 2006: A 3D version of the 0D file with some properties added. Their values are the same as those shown in the Enhanced NCI Datebase Browser. This file contains 250,250 structures as of August 2000 (one missing because of technical reasons). 3D coordinates have been calculated by Corina 3.0 and are available for 248,574 structures. The following properties are included:
- NSC Number
- Molecular weight
- Name (ACD)
- Formula
- CAS Registry Number
- SMILES string
- KOW logP
- Experimental logP
- ACD logP
- Drug Likeness (std)
- Drug Likeness (neg)
250,250 structures in SDF format. This is a 145 MB gzipped file that uncompresses to about 1005 MB.
Release 1 Files - October 1999
"0D"
The "raw" structure data that were used to build the previous version of the Enhanced NCI Database Browser, plus about 2,900 new structures. These are 249,081 "0D" structures (i.e. all coordinates set to 0.0) as of October 1999 in SDF format, in one file compressed with the widely available program gzip.
249,081 structures in SDF format. This is a 16.5 MB MB gzipped file that uncompresses to about 380 MB.
DownloadSMILES
A SMILES version of the structures (i.e. the above "0D" dataset) that were used to build this service, plus about 2,900 new structures. These are 249,081 structures as of October 1999 in SMILES format, in one file compressed with the widely available program gzip. SMILES string were generated with the help of CACTVS. (This is a newly generated dataset and therefore not guaranteed to contain SMILES strings identical, for each compound, with those in previous SMILES string files, such as downloadable data from DTP.)
249,081 structures in SMILES format. This is a 3.2 MB gzipped file that uncompresses to about 18.5 MB.
Download2D
2D version of NCI Open Database compounds as of October 1999. 2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)
249,081 structures in SDF format. This is a 40 MB gzipped file that uncompresses to about 527 MB.
Download2D + Biological Data
2D versions of NCI Open Database compounds as of October 1999, with biological test data added. These data are publicly available from the DTP Human Tumor Cell Line Screen and/or the DTP AIDS Antiviral Screen. 2D coordinates (essentially structure drawings) calculated with CACTVS. Attention: Stereochemistry assigned by CACTVS according to default rules due to lack of stereochemical information in the original NCI data. (See also the 3D section.)
249,081 structures in SDF format. Cancer data are as of August 1999, AIDS data and structures are as of October 1999. This is a 56 MB gzipped file that uncompresses to about 723 MB.
Download32,577 structures with cancer test data in SDF format. Cancer data are as of August 1999. This is a 20 MB gzipped file that uncompresses to about 273 MB.
Download42,689 structures in SDF format. AIDS test data is as of October 1999 This is a 9.1 MB gzipped file that uncompresses to about 114 MB.
Download23,031 structures in SDF format for which both cancer and AIDS data are available. Cancer data are as of August 1999, AIDS data and structures are as of October 1999. This is a 13.5 MB gzipped file that uncompresses to about 195 MB.
Download3D
A 3D version of the 0D file, containing 249,071 structures as of October 1999. The program CORINA v. 1.7 was used to generate the 3D coordinates. Please note that, just as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry of chiral compounds is not guaranteed to be correct due to the lack of stereochemical information in the original data. This is not a shortcoming of CORINA. Please also note that, as of now, the 3D structures in this bulk file were not generated with the same version of CORINA as is used in the Browser, the latter being somewhat newer. This file is the result of a one-time conversion; no efforts have been undertaken to compare the conformations in it with those you obtain from the Browser (although we don't necessarily expect huge differences.)
249,071 structures in SDF format. This is a 127 MB gzipped file that uncompresses to about 574 MB.
DownloadFor more information on all the files in this release, please see the Technical Notes.
Last Update: