Today, GenomeWeb wrote a piece on the STORMSeq pipeline (Scalable Tools for Open source Read Mapping), our newest project in the goal of enabling the public to explore their own personal genetic data. In this pipeline, users upload reads to Amazon S3 and start a webserver in Amazon EC2, where they can set parameters for read mapping and variant calling, all in a graphical user interface. Once they click “GO!”, the pipeline runs and progress and quality control metrics can be monitored, and the final results of the pipeline are uploaded back to Amazon S3. The pipeline itself is free, though the user pays for storage (currently $0.1 per Gb-month) and compute time (currently estimated about $1-2 per exome, $25-35 per genome) on the Amazon cloud. A publication with details about the pipeline is forthcoming, but the pipeline is ready to use now (currently in version 0.8.5) with instructions for use at www.stormseq.org (and the code is available on github.com/konradjk/stormseq).
Author Archive for Konrad Karczewski
We’ve been putting the finishing touches on Exploring Personal Genomics (available for pre-order on Amazon!), but I thought I’d share the wordle for the entire book:
This past week, I attended the 2012 Pacific Symposium on Biocomputing, a week of science in the sun. For me, the week started with the “Systems Pharmacogenomics: Bridging the Gap” workshop. The session started with the systems theme from Trey Ideker, discussing epistatic interactions in yeast and how to apply the results to GWAS and eQTLs, bringing up challenges in the complexity of the analysis. Fortunately, model organisms are a bit easier to work with, and generating all pairwise combinations of, say, gene knockouts in yeast, is a difficult, but not unreasonable task. Pankaj Agrawal brought a perspective from industry and how systems are being used at GSK. He described the use of high-throughput analysis, like the Connectivity Map and GWAS data, for pharmaceutical purposes like drug repurposing: when a GWAS hit lies in a pharmacogene (for a different phenotype than the drug’s indication), seems like a good time to think about the drug for other purposes.
While the first talks were highly optimistic, the reality became slightly more apparent in Stephane Bourgeois’s talk on the International Warfarin Pharmacogenomic Consortium (IWPC)’s meta-analysis of warfarin data. Warfarin is a poster-child of genomic personalized medicine, as ~55% of variance in stable warfarin dose can be explained by clinical and genetic factors. While this would be marked as a success by most people studying complex traits, it suffers the same problems when trying to explain the remainder of the variance. Bourgeois presented a large meta-analysis that was only able to explain incrementally more variance, despite increases in power. Here, there is a need to be more clever and integrate systems approaches (like the epistasis ones Trey Ideker talked about) to generate results of clinical significance (not to mention the burden is greater: getting a drug label approved by the FDA is slightly more difficult than getting a paper in NEJM). Russ Altman talked about using very different sources of data, including molecular data, text mining, expression data, and adverse event reports, for pharmacogenomic inference. Afterwards, the discussion brought up some new challenges. In single celled organisms, the consequence of a gene-gene interaction may be apparent, but when it comes to systems pharmacogenomics, the presence of multiple different cell types in multiple different organs complicates matters further. This will require new methods, such as rapid mutagenesis to simulate the effect of rare variants.
The evening got interesting when a debate on the ethics of informed consent and return of genetic results to patients was discussed. The premise involved a resolution where patients would be sequenced and their genetic results linked to their EMR and shared with researchers. Greg Hampikian and Eric Meslin debated the merits and downsides of this resolution, asking the audience to move to two sides of the room depending on their stance at the time. The concepts of patient “dignity” were discussed, as well as data privacy and openness and who should be allowed access to the data. The notion of “genetic exceptionalism” was challenged and scientists’ motives were questioned. The concept of an “opt-out” system was discussed, where proponents did not share the fears that the opponents did, while the opponents lamented the lack of proper informed consent. The debate was heated, to say the least.
The next day, Elaine Mardis’s keynote talk brought a number of applications of computational methods to clinical outcomes. Methods to detect variation in heterogeneous samples, as well as sequencing followed by re-sequencing, were used for characterizing relapse. The rest of the day involved text-mining approaches to pharmacogenomics, extracting drug-gene and drug-drug relationships from mining pharmacogenomic literature, as well as a followup to the Systems Pharmacogenomics workshop, which was quite practical, discussing issues of applying results from a model organism to humans (how to get from a pathway in one species to the homologous pathway in another) and using electronic medical records for biomedical validation.
At the discussion following the Personalized Medicine session (at which I presented the Interpretome platform), the privacy discussion resurfaced. We discussed issues of using genetic information in courses such as Stanford’s course in Personalized Medicine, as well as individual’s reactions to obtaining genetic information and whether individuals and patients undergo anxiety or change behaviors in response to the information.
All in all, the week was an interesting look into the interplay of complex mathematical modeling, biological discovery, and the practical and ethical issues involved (not to mention a great week in Hawaii with a great group of scientists). It is a conference I would highly recommend and am looking forward to going back soon.
I’ve been spending most of my time these days thinking about our personal genomes and their implications to our daily lives. I’ve had time to reflect on this during past few events I’ve attended (the Cold Spring Harbor Personal Genomes Conference, the Open Science Summit, and a BioCurious Advanced Personal Genomics I helped teach) and while knee-deep in writing a chapter on ethics for “Exploring Personal Genomics” (see below).
The Personal Genomes conference at CSHL (which, incidentally, is joining the Pharmacogenomics conference next year to become “Personal Genomes & Medical Genomics”) featured exciting methods and state-of-the-art analyses for personal genomics, but also clinical implications, including stories of the use of genome sequencing in aiding clinical decision making. Many there argued that the time for clinical sequencing is now, but the issue was raised of who would interpret genomic data. There are many pathologists interpreting laboratory data for diagnosis, but would they want to (and could they be trained to) interpret genomes? Later, at the Open Science Summit, the current practical implications were discussed, along with the ethics and current limitations (the “hopes and hypes”), but the feeling was still optimistic. Then, on Wednesday, at the Biocurious hacker space, the “Advanced Personal Genomics” workshop that I helped teach provided a glimpse into genome interpretation for early adopters. The response at this event was surprisingly positive: even after discussing the limitations of personal genomics at length, these individuals were still curious and enthusiastic about learning more about their genomes.
From our end, our paper describing the Interpretome system is online now, to be presented at the Pacific Symposium for Biocomputing in January. The paper serves as a rigorous description of the platform, including examples of personal genomics analyses and the modular nature of the system. Furthermore, Joel Dudley and I have been working on a book entitled “Exploring Personal Genomics,” a guide to understanding and interpreting a personal genome, to be published in 2012 by Oxford University Press (anyone who would like to be notified when the book is released is invited to enter their email address here).
Yesterday, a paper on the analysis and interpretation of the genomes of a family of four was released in PLoS Genetics and featured in the Wall Street Journal, spearheaded by Rick Dewey and Euan Ashley. I was fortunate to be involved in this groundbreaking analysis, a logical next step to the clinical interpretation of Steve Quake’s genome last year in the Lancet. Collaborating on this paper got me thinking about analysis of family genomes in the age of GWAS (Genome-Wide Association Studies).
In the linkage studies of the past, researchers focused on families and segregation patterns of alleles to identify genes significantly linked with disease. These studies worked great for rare diseases, as they could focus on a single linked region/gene at a time. But for complex/multigenic diseases, the segregation patterns of a disease are not as clear, and the GWAS community has stepped in to tackle these problems on a larger scale. However, the genetic basis of only a few diseases have been successfully mapped by GWAS (to, say, greater than 50% of the genetic variance explained by the factors identified in the studies), such as age-related macular degeneration, and the bulk of diseases and traits have come up short. For complex diseases, the difficulty is the same as before: with so many unaccounted-for variables, we are back to a needle in a haystack problem. There is a great potential for combining family data with GWAS-based methods: in an analogous method to Sarah Ng and Jay Shendure’s identification of disease genes in rare diseases by exome sequencing, the ability to “subtract out” some of the noise (that may be family-specific) may result in more reliable results. Specifically, an unaffected family member may be used to down-weight the SNPs in common with an affected subject.
Looking at the genomes of the whole family at once in a clinical assessment context (applying results from large studies to a smaller number of individuals) was crucial to this analysis. At the most basic level, simply estimating the error rates is highly aided by the sequencing of multiple family members: knowing that the likelihood of AG and GG parents will have an AA child is vanishingly small gives us a confidence level for the SNP calls we do make. Then, when it comes to assessment of disease risk, analysis of multiple family members demonstrates the exact problem of complex diseases. While both parents may not be at risk for a disease, the exact combination of alleles passed down can confer a greater risk than the average of the parents. It is precisely here that genetic risk has a potential to trump family history in clinical analysis. At present, family history is a great predictor of clinical outcome, as it encapsulates much of the uncharacterized risk conferred by genetics. However, as our understanding of the genetic factors of disease increases, the genetic profile can incorporate something the family history cannot: the precise pattern of allele segregation. Finally, a family analysis can allow for phased genomes, which can inform the presence of “compound heterozygotes,” or cases where both alleles of a gene are affected by 2 different SNPs. While each of these may not be damaging on their own, the combination of both alleles may render both copies of the gene ineffective.
As the availability of genome-wide methods rapidly expanded, analysis of families seemed to go out of fashion for a while. Of course, we will need sophisticated informatics methods to tease out the signal from the noise, and these would not be trivial. However, with the current trends of the cost of genotyping and genome sequencing, a dataset of 100 families with a common disease is not out of sight. Then, of course, the clinical assessment of a family genome is another challenge, to which this paper brings a novel perspective, and it will be fascinating to follow the further development of these methods.
First, let me start off by saying thanks to everyone that has explored their data on Interpretome so far. We’ve had a tremendous response to the site and I couldn’t be more thrilled. I wanted to provide an update on certain perspectives I’ve gotten from scouring the web for reactions to the analyses on the site.
It seems that many have enjoyed the Ancestry and Neandertal analyses, and to be honest, these are some my favorites too! They truly are a fascinating look into the role of ancient DNA and human migration patterns throughout history. Plotting yourself on a world/continent map can really give a perspective on where you’ve come from. My (not surprising) Polish ancestry jumps out on the POPRES dataset, clustering among Polish and Northern Europeans.
For those looking to explore their own, for most of the datasets, plotting PC1 vs. PC2 with any number of SNPs (the more, the better) should give good results, assuming you are somewhat similar to at least one of the populations in that reference panel. This means that Africans will likely find interesting results from the African PCA, but it is uninterpretable for Europeans. (As an aside, the POPRES dataset is best run with PC1 vs. PC4 and using the relevant platform, 43K for v2 and 74K for v3).
For the Chromosome Painting, at present, there is no “right” set of parameters. We use a heuristic/approximation algorithm to determine the ancestry tracks and we are actively developing more robust methods (as well as adding more distinct, i.e. less admixed reference populations). The challenge is to provide an accurate tool that can be run in your own browser (without too much computing power or sophisticated custom software). At the moment, the Hapmap 2 painting should work reasonably well: tuning the parameters will affect the sensitivity, at the cost of some noise.
Also coming soon is a new method to illustrate disease risk analysis, grouping SNPs by disease to visualize them easier. While we don’t intend to provide any actual predictive analysis, our mission is to provide the tools needed for anyone to explore their genome. We hope that this will educate the public (including clinicians and scientists as well as hobbyists) about the power/potential power/limitations of a personal genome and enable individuals to do and share their own custom analysis.
In the race for the $1,000 genome, the issue of the $1,000,000 interpretation has not been forgotten. Combing through millions of variants in a personal genome has presented numerous challenges for all parties involved: the physician looking to add genomic measurements to inform their diagnoses, the patients trying to figure out what they should be worried about, and the hobbyists interested in what their DNA means to them. Direct-to-consumer genetic testing companies such as 23andme, Lumigenix, and Navigenics offer a glimpse into the interpretation of a genome. These companies curate literature on gene-trait associations and provide attractive user interfaces to navigating a personal genotype. However, the interpretations offered by these companies often differ, not because of inherent differences in technologies, but in which variants they choose consider in their calculations. While many have taken this fact to indicate a weakness of the genetic testing industry (and indeed, it is one that needs to be addressed), it is also a reflection of the dynamic nature of the field.
The site provides an open-source framework for personal genome interpretation, demonstrating the power of genotyping for ancestral and clinical analysis (though it should be noted that this service, like 23andme, should not be used for diagnostic purposes and is not approved by the FDA). We also feature an exploratory section, mirroring the lectures of the course, as well as other analyses of interest and an option to upload your own analyses. In here, you’ll find the fun Neandertal calculator, to calculate your number of alleles likely derived from Neandertal (according to Green et al.).
As new results like this one pop up, this open framework will allow users to customize their analyses based on their interest. Explore your genome at www.interpretome.com.