AtomQM is a compound dataset which includes atom-level and bond-level quantum mechanical descriptors for applications such as property prediction and model benchmarking. It currently includes 257,986 industrially relevant compounds taken from patent data, consisting of small molecules, with mass below 500 u, which include the elements C, H, N, O, P, S, F, Cl, Na, Mg and Ca.
The dataset was generated using a heavily parallelized automated DFT pipeline developed by M. Nouman as part of his thesis project (Jensen group, MIT). It remains under active development with a primary focus on improving data quality.
Generation Procedure
Due to the computational cost of molecular quantum mechanical calculations, the dataset relies on optimized molecule geometries contained in the PubChemQC dataset by Nakata et al.1,2 as a starting point for calculations in higher levels of theory. Moreover, relevant compound selection was carried out by comparing the molecules available in the PubChemQC with reaction information given in a heavily sanitized patent database. An overview of the data generation process is given below.

After relevant compounds are selected in the “subset matching” phase, their geometries are used to compute molecule-, atom- and bond-level descriptors which are then parsed, formatted and sanitized in the “output parsing” step. A thorough description of the pipeline is given in the author’s master’s thesis available here.
All computations were performed on the Supercloud high-performance computing cluster3 using the B3LYP-D3BJ/def2-TZVPP level of theory4,5 as implemented in the ORCA 5.0.4 software6 with the NBO 7.0.10 extension.7
While the pipeline is flexible enough to produce a variety of output formats, the website contains only the .json files, which can be easily converted and deployed to suit a wide range of applications. For access to graph representations, particularly useful for training and benchmarking graph neural networks (GNNs), please personally contact the author.
Sample Plots
As a brief overview of the AtomQM dataset, the plots below depict some statistics about a few of the atom-level and bond-level features available.
Atom-level
Bond-level
Included Descriptors
- Atom coordinates
- Atom types
- Atomic masses
- Magnetic dipoles
- Natural charges
- Hirshfeld charges8
- Natural orbitals
- Orbital occupancy
- Chemically active space (CAS)*
- SCF energies
- Bond indices (COO format)
- Bond energies
- Bond lengths
*Here, for simplicity’s sake, we define the chemically active space as the 4 highest occupied orbitals and the 4 lowest unoccupied orbitals. Natural orbitals and natural population analysis are employed as described by Weinhold et al.9,10
References
- Nakata, M.; Shimazaki, T. Journal of Chemical Information and Modeling 2017, 57,1300–1308.
- Nakata, M.; Shimazaki, T.; Hashimoto, M.; Maeda, T. Journal of Chemical Information and Modeling 2020, 60, 5891–5899.
- Reuther, A. et al. In 2018 IEEE High Performance extreme Computing Conference (HPEC), 2018, pp 1–6.
- Becke, A. D. Journal of Chemical Physics 1993, 98, 5648–5652.
- Grimme, S.; Ehrlich, S.; Goerigk, L. Journal of Computational Chemistry 2011, 32, 1456–1465.
- Neese, F. WIREs Computational Molecular Science 2022, 12, e1606.
- Glendening, E. D, Landis, C. R., Weinhold, F. In Complementary Bonding Analysis, 2021, pp. 129-156.
- Hirshfeld, F. L. Theoretica chimica acta 1977, 44, 129–138.
- Weinhold, F. Journal of Computational Chemistry 1983, 33, 2363–2379
- Reed, A. E.; Weinstock, R. B.; Weinhold, F. The Journal of Chemical Physics 1985, 83, 735–746