Today, GenomeWeb wrote a piece on the STORMSeq pipeline (Scalable Tools for Open source Read Mapping), our newest project in the goal of enabling the public to explore their own personal genetic data. In this pipeline, users upload reads to Amazon S3 and start a webserver in Amazon EC2, where they can set parameters for read mapping and variant calling, all in a graphical user interface. Once they click “GO!”, the pipeline runs and progress and quality control metrics can be monitored, and the final results of the pipeline are uploaded back to Amazon S3. The pipeline itself is free, though the user pays for storage (currently $0.1 per Gb-month) and compute time (currently estimated about $1-2 per exome, $25-35 per genome) on the Amazon cloud. A publication with details about the pipeline is forthcoming, but the pipeline is ready to use now (currently in version 0.8.5) with instructions for use at www.stormseq.org (and the code is available on github.com/konradjk/stormseq).
Archive for the 'Uncategorized' Category
We’ve been putting the finishing touches on Exploring Personal Genomics (available for pre-order on Amazon!), but I thought I’d share the wordle for the entire book:
This past week, I attended the 2012 Pacific Symposium on Biocomputing, a week of science in the sun. For me, the week started with the “Systems Pharmacogenomics: Bridging the Gap” workshop. The session started with the systems theme from Trey Ideker, discussing epistatic interactions in yeast and how to apply the results to GWAS and eQTLs, bringing up challenges in the complexity of the analysis. Fortunately, model organisms are a bit easier to work with, and generating all pairwise combinations of, say, gene knockouts in yeast, is a difficult, but not unreasonable task. Pankaj Agrawal brought a perspective from industry and how systems are being used at GSK. He described the use of high-throughput analysis, like the Connectivity Map and GWAS data, for pharmaceutical purposes like drug repurposing: when a GWAS hit lies in a pharmacogene (for a different phenotype than the drug’s indication), seems like a good time to think about the drug for other purposes.
While the first talks were highly optimistic, the reality became slightly more apparent in Stephane Bourgeois’s talk on the International Warfarin Pharmacogenomic Consortium (IWPC)’s meta-analysis of warfarin data. Warfarin is a poster-child of genomic personalized medicine, as ~55% of variance in stable warfarin dose can be explained by clinical and genetic factors. While this would be marked as a success by most people studying complex traits, it suffers the same problems when trying to explain the remainder of the variance. Bourgeois presented a large meta-analysis that was only able to explain incrementally more variance, despite increases in power. Here, there is a need to be more clever and integrate systems approaches (like the epistasis ones Trey Ideker talked about) to generate results of clinical significance (not to mention the burden is greater: getting a drug label approved by the FDA is slightly more difficult than getting a paper in NEJM). Russ Altman talked about using very different sources of data, including molecular data, text mining, expression data, and adverse event reports, for pharmacogenomic inference. Afterwards, the discussion brought up some new challenges. In single celled organisms, the consequence of a gene-gene interaction may be apparent, but when it comes to systems pharmacogenomics, the presence of multiple different cell types in multiple different organs complicates matters further. This will require new methods, such as rapid mutagenesis to simulate the effect of rare variants.
The evening got interesting when a debate on the ethics of informed consent and return of genetic results to patients was discussed. The premise involved a resolution where patients would be sequenced and their genetic results linked to their EMR and shared with researchers. Greg Hampikian and Eric Meslin debated the merits and downsides of this resolution, asking the audience to move to two sides of the room depending on their stance at the time. The concepts of patient “dignity” were discussed, as well as data privacy and openness and who should be allowed access to the data. The notion of “genetic exceptionalism” was challenged and scientists’ motives were questioned. The concept of an “opt-out” system was discussed, where proponents did not share the fears that the opponents did, while the opponents lamented the lack of proper informed consent. The debate was heated, to say the least.
The next day, Elaine Mardis’s keynote talk brought a number of applications of computational methods to clinical outcomes. Methods to detect variation in heterogeneous samples, as well as sequencing followed by re-sequencing, were used for characterizing relapse. The rest of the day involved text-mining approaches to pharmacogenomics, extracting drug-gene and drug-drug relationships from mining pharmacogenomic literature, as well as a followup to the Systems Pharmacogenomics workshop, which was quite practical, discussing issues of applying results from a model organism to humans (how to get from a pathway in one species to the homologous pathway in another) and using electronic medical records for biomedical validation.
At the discussion following the Personalized Medicine session (at which I presented the Interpretome platform), the privacy discussion resurfaced. We discussed issues of using genetic information in courses such as Stanford’s course in Personalized Medicine, as well as individual’s reactions to obtaining genetic information and whether individuals and patients undergo anxiety or change behaviors in response to the information.
All in all, the week was an interesting look into the interplay of complex mathematical modeling, biological discovery, and the practical and ethical issues involved (not to mention a great week in Hawaii with a great group of scientists). It is a conference I would highly recommend and am looking forward to going back soon.
I’ve been spending most of my time these days thinking about our personal genomes and their implications to our daily lives. I’ve had time to reflect on this during past few events I’ve attended (the Cold Spring Harbor Personal Genomes Conference, the Open Science Summit, and a BioCurious Advanced Personal Genomics I helped teach) and while knee-deep in writing a chapter on ethics for “Exploring Personal Genomics” (see below).
The Personal Genomes conference at CSHL (which, incidentally, is joining the Pharmacogenomics conference next year to become “Personal Genomes & Medical Genomics”) featured exciting methods and state-of-the-art analyses for personal genomics, but also clinical implications, including stories of the use of genome sequencing in aiding clinical decision making. Many there argued that the time for clinical sequencing is now, but the issue was raised of who would interpret genomic data. There are many pathologists interpreting laboratory data for diagnosis, but would they want to (and could they be trained to) interpret genomes? Later, at the Open Science Summit, the current practical implications were discussed, along with the ethics and current limitations (the “hopes and hypes”), but the feeling was still optimistic. Then, on Wednesday, at the Biocurious hacker space, the “Advanced Personal Genomics” workshop that I helped teach provided a glimpse into genome interpretation for early adopters. The response at this event was surprisingly positive: even after discussing the limitations of personal genomics at length, these individuals were still curious and enthusiastic about learning more about their genomes.
From our end, our paper describing the Interpretome system is online now, to be presented at the Pacific Symposium for Biocomputing in January. The paper serves as a rigorous description of the platform, including examples of personal genomics analyses and the modular nature of the system. Furthermore, Joel Dudley and I have been working on a book entitled “Exploring Personal Genomics,” a guide to understanding and interpreting a personal genome, to be published in 2012 by Oxford University Press (anyone who would like to be notified when the book is released is invited to enter their email address here).
First, let me start off by saying thanks to everyone that has explored their data on Interpretome so far. We’ve had a tremendous response to the site and I couldn’t be more thrilled. I wanted to provide an update on certain perspectives I’ve gotten from scouring the web for reactions to the analyses on the site.
It seems that many have enjoyed the Ancestry and Neandertal analyses, and to be honest, these are some my favorites too! They truly are a fascinating look into the role of ancient DNA and human migration patterns throughout history. Plotting yourself on a world/continent map can really give a perspective on where you’ve come from. My (not surprising) Polish ancestry jumps out on the POPRES dataset, clustering among Polish and Northern Europeans.
For those looking to explore their own, for most of the datasets, plotting PC1 vs. PC2 with any number of SNPs (the more, the better) should give good results, assuming you are somewhat similar to at least one of the populations in that reference panel. This means that Africans will likely find interesting results from the African PCA, but it is uninterpretable for Europeans. (As an aside, the POPRES dataset is best run with PC1 vs. PC4 and using the relevant platform, 43K for v2 and 74K for v3).
For the Chromosome Painting, at present, there is no “right” set of parameters. We use a heuristic/approximation algorithm to determine the ancestry tracks and we are actively developing more robust methods (as well as adding more distinct, i.e. less admixed reference populations). The challenge is to provide an accurate tool that can be run in your own browser (without too much computing power or sophisticated custom software). At the moment, the Hapmap 2 painting should work reasonably well: tuning the parameters will affect the sensitivity, at the cost of some noise.
Also coming soon is a new method to illustrate disease risk analysis, grouping SNPs by disease to visualize them easier. While we don’t intend to provide any actual predictive analysis, our mission is to provide the tools needed for anyone to explore their genome. We hope that this will educate the public (including clinicians and scientists as well as hobbyists) about the power/potential power/limitations of a personal genome and enable individuals to do and share their own custom analysis.
In the race for the $1,000 genome, the issue of the $1,000,000 interpretation has not been forgotten. Combing through millions of variants in a personal genome has presented numerous challenges for all parties involved: the physician looking to add genomic measurements to inform their diagnoses, the patients trying to figure out what they should be worried about, and the hobbyists interested in what their DNA means to them. Direct-to-consumer genetic testing companies such as 23andme, Lumigenix, and Navigenics offer a glimpse into the interpretation of a genome. These companies curate literature on gene-trait associations and provide attractive user interfaces to navigating a personal genotype. However, the interpretations offered by these companies often differ, not because of inherent differences in technologies, but in which variants they choose consider in their calculations. While many have taken this fact to indicate a weakness of the genetic testing industry (and indeed, it is one that needs to be addressed), it is also a reflection of the dynamic nature of the field.
The site provides an open-source framework for personal genome interpretation, demonstrating the power of genotyping for ancestral and clinical analysis (though it should be noted that this service, like 23andme, should not be used for diagnostic purposes and is not approved by the FDA). We also feature an exploratory section, mirroring the lectures of the course, as well as other analyses of interest and an option to upload your own analyses. In here, you’ll find the fun Neandertal calculator, to calculate your number of alleles likely derived from Neandertal (according to Green et al.).
As new results like this one pop up, this open framework will allow users to customize their analyses based on their interest. Explore your genome at www.interpretome.com.
As some of you may know, for the better part of this year, I have been involved in the organization of BCATS 2010 (Biomedical Computation at Stanford), a one-day student-run conference on Stanford’s campus. Yesterday, I had the great privilege of being the chair of this interdisciplinary conference, which featured 14 talks, 10 spotlight/bullet talks, and close to 50 posters (abstract book). However, this conference was not your typical themed meeting. Rather than a conference where everyone in attendance was an expert in the specific narrow field of the conference, BCATS attendees all spoke the same language of computational and statistical analysis, but applied these methods to very different problems across biology and medicine.
The day started with Zemin Zhang from Genentech, for whom we are very grateful to have had deliver the first keynote address, on his work in genomics and computational biology. Dr. Zhang brought a perspective on the power of genome sequencing in understanding the complex biological basis of cancer. From there, the first group of student talks focused on computational approaches to the study of systems biology, from analysis of transcriptomics data to an integrative model of a whole cell. The second session focused on analysis of existing datasets to learn about drugs and learning about their effects and interactions with other drugs. This was a particularly interesting session, as the speakers presented work that applied mathematical and statistical methods to a topic that everyone could understand without too much technical knowledge: clinical use of drugs, including things that can go right and things that can go wrong.
In the afternoon, BJ Fregly from University of Florida provided a fascinating look into personalized therapy for osteoarthritis through simulation of arthritis development, rehabilitation treatment, and forces on knee joints and muscles. In the keynote and the session that followed, the talks stressed the importance of biological models for developing our understanding of various biomechanical and biochemical processes. The final session’s talks brought together multiple sources of data for representation and interpretation of clinical data, reminding us that direct application to a clinical setting is never too far off.
All in all, the conference brought together researchers from across the fields of biocomputation in a unified setting. While the experience taught me more than I’d like to know about the logistics of organizing such an event (along with everything that can go wrong at the last minute), as soon as I sat down and started listening to the first talks, I was reminded of the quality of the science that really goes on here at Stanford. On a more personal note, I would like to thank my co-organizers, Rob Tirrell, Jessica Faruque, Amir Ghazvinian, Matt Demers, and Keyan Salari for all their help throughout, as well as our sponsors and volunteers for their support. It was a great day and I look forward to next year’s conference.