Described here is ‘toffee’, an open file format for mass spectrometry data with lossless compression that gives file sizes similar to the original vendor format. It is shown that mzML and toffee are equivalent when processing data using OpenSWATH algorithms, in additional to novel applications that are enabled by new data access patterns. For instance, a peptide-centric deep-learning pipeline for peptide identification is proposed. Specifically in the context of ProCan, this reduces our long-term storage costs from >$60k per month to around $6k per month -- a critical development for long-term sustainability of bio-bank scale proteomics.
The cancer tissue proteome has enormous potential as a source of novel predictive biomarkers in oncology. Progress in the development of mass spectrometry (MS)‐based tissue proteomics now presents an opportunity to exploit this by applying the strategies of comprehensive molecular profiling and big‐data analytics that are refined in other fields of ‘omics research. ProCan (ProCan is a registered trademark) is a program aiming to generate high‐quality tissue proteomic data across a broad spectrum of cancer types. It is based on data‐independent acquisition–MS proteomic analysis of annotated tissue samples sourced through collaboration with expert clinical and cancer research groups. The practical requirements of a high‐throughput translational research program have shaped the approach that ProCan is taking to address challenges in study design, sample preparation, raw data acquisition, and data analysis. The ultimate goal is to establish a large proteomics knowledge‐base that, in combination with other cancer ‘omics data, will accelerate cancer research.
In the current study, we show how ProCan90, a curated data set of HEK293 technical replicates, can be used to optimize the configuration options for algorithms in the OpenSWATH pipeline. Furthermore, we use this case study as a proof of concept for horizontal scaling of such a pipeline to allow 45 810 computational analysis runs of OpenSWATH to be completed within four and a half days on a budget of US $10 000. Through the use of Amazon Web Services (AWS), we have successfully processed each of the ProCan 90 files with 506 combinations of input parameters. In total, the project consumed more than 340 000 core hours of compute and generated in excess of 26 TB of data. Using the resulting data and a set of quantitative metrics, we show an analysis pathway that allows the calculation of two optimal parameter sets, one for a compute rich environment (where run time is not a constraint), and another for a compute poor environment (where run time is optimized). For the same input files and the compute rich parameter set, we show a 29.8% improvement in the number of quality protein (>2 peptide) identifications found compared to the current OpenSWATH defaults, with negligible adverse effects on quantification reproducibility or drop in identification confidence, and a median run time of 75 min (103% increase). For the compute poor parameter set, we find a 55% improvement in the run time from the default parameter set, at the expense of a 3.4% decrease in the number of quality protein identifications, and an intensity CV decrease from 14.0% to 13.7%.
We have developed a streamlined proteomic sample preparation protocol termed Accelerated Barocycler Lysis and Extraction (ABLE) that substantially reduces the time and cost of tissue sample processing. ABLE is based on pressure cycling technology (PCT) for rapid tissue solubilization and reliable, controlled proteolytic digestion. Here, a previously reported PCT based protocol was optimized using 1–4 mg biopsy punches from rat kidney. The tissue denaturant urea was substituted with a combination of sodium deoxycholate (SDC) and N-propanol. ABLE produced comparable numbers of protein identifications in half the sample preparation time, being ready for MS injection in 3 h compared with 6 h for the conventional urea based method. To validate ABLE, it was applied to a diverse range of rat tissues (kidney, lung, muscle, brain, testis), human HEK 293 cell lines, and human ovarian cancer samples, followed by SWATH-mass spectrometry (SWATH-MS). There were similar numbers of quantified proteins between ABLE-SWATH and the conventional method, with greater than 70% overlap for all sample types, except muscle (58%). The ABLE protocol offers a standardized, high-throughput, efficient, and reproducible proteomic preparation method that when coupled with SWATH-MS has the potential to accelerate proteomics analysis to achieve a clinically relevant turn-around time.