Today, GenomeWeb wrote a piece on the STORMSeq pipeline (Scalable Tools for Open source Read Mapping), our newest project in the goal of enabling the public to explore their own personal genetic data. In this pipeline, users upload reads to Amazon S3 and start a webserver in Amazon EC2, where they can set parameters for read mapping and variant calling, all in a graphical user interface. Once they click “GO!”, the pipeline runs and progress and quality control metrics can be monitored, and the final results of the pipeline are uploaded back to Amazon S3. The pipeline itself is free, though the user pays for storage (currently $0.1 per Gb-month) and compute time (currently estimated about $1-2 per exome, $25-35 per genome) on the Amazon cloud. A publication with details about the pipeline is forthcoming, but the pipeline is ready to use now (currently in version 0.8.5) with instructions for use at www.stormseq.org (and the code is available on github.com/konradjk/stormseq).
Author Archive for Konrad Karczewski
We’ve been putting the finishing touches on Exploring Personal Genomics (available for pre-order on Amazon!), but I thought I’d share the wordle for the entire book:
This past week, I attended the 2012 Pacific Symposium on Biocomputing, a week of science in the sun. For me, the week started with the “Systems Pharmacogenomics: Bridging the Gap” workshop. The session started with the systems theme from Trey Ideker, discussing epistatic interactions in yeast and how to apply the results to GWAS and eQTLs, bringing up challenges in the complexity of the analysis. Fortunately, model organisms are a bit easier to work with, and generating all pairwise combinations of, say, gene knockouts in yeast, is a difficult, but not unreasonable task. Pankaj Agrawal brought a perspective from industry and how systems are being used at GSK. He described the use of high-throughput analysis, like the Connectivity Map and GWAS data, for pharmaceutical purposes like drug repurposing: when a GWAS hit lies in a pharmacogene (for a different phenotype than the drug’s indication), seems like a good time to think about the drug for other purposes.
While the first talks were highly optimistic, the reality became slightly more apparent in Stephane Bourgeois’s talk on the International Warfarin Pharmacogenomic Consortium (IWPC)’s meta-analysis of warfarin data. Warfarin is a poster-child of genomic personalized medicine, as ~55% of variance in stable warfarin dose can be explained by clinical and genetic factors. While this would be marked as a success by most people studying complex traits, it suffers the same problems when trying to explain the remainder of the variance. Bourgeois presented a large meta-analysis that was only able to explain incrementally more variance, despite increases in power. Here, there is a need to be more clever and integrate systems approaches (like the epistasis ones Trey Ideker talked about) to generate results of clinical significance (not to mention the burden is greater: getting a drug label approved by the FDA is slightly more difficult than getting a paper in NEJM). Russ Altman talked about using very different sources of data, including molecular data, text mining, expression data, and adverse event reports, for pharmacogenomic inference. Afterwards, the discussion brought up some new challenges. In single celled organisms, the consequence of a gene-gene interaction may be apparent, but when it comes to systems pharmacogenomics, the presence of multiple different cell types in multiple different organs complicates matters further. This will require new methods, such as rapid mutagenesis to simulate the effect of rare variants.
The evening got interesting when a debate on the ethics of informed consent and return of genetic results to patients was discussed. The premise involved a resolution where patients would be sequenced and their genetic results linked to their EMR and shared with researchers. Greg Hampikian and Eric Meslin debated the merits and downsides of this resolution, asking the audience to move to two sides of the room depending on their stance at the time. The concepts of patient “dignity” were discussed, as well as data privacy and openness and who should be allowed access to the data. The notion of “genetic exceptionalism” was challenged and scientists’ motives were questioned. The concept of an “opt-out” system was discussed, where proponents did not share the fears that the opponents did, while the opponents lamented the lack of proper informed consent. The debate was heated, to say the least.
The next day, Elaine Mardis’s keynote talk brought a number of applications of computational methods to clinical outcomes. Methods to detect variation in heterogeneous samples, as well as sequencing followed by re-sequencing, were used for characterizing relapse. The rest of the day involved text-mining approaches to pharmacogenomics, extracting drug-gene and drug-drug relationships from mining pharmacogenomic literature, as well as a followup to the Systems Pharmacogenomics workshop, which was quite practical, discussing issues of applying results from a model organism to humans (how to get from a pathway in one species to the homologous pathway in another) and using electronic medical records for biomedical validation.
At the discussion following the Personalized Medicine session (at which I presented the Interpretome platform), the privacy discussion resurfaced. We discussed issues of using genetic information in courses such as Stanford’s course in Personalized Medicine, as well as individual’s reactions to obtaining genetic information and whether individuals and patients undergo anxiety or change behaviors in response to the information.
All in all, the week was an interesting look into the interplay of complex mathematical modeling, biological discovery, and the practical and ethical issues involved (not to mention a great week in Hawaii with a great group of scientists). It is a conference I would highly recommend and am looking forward to going back soon.
I’ve been spending most of my time these days thinking about our personal genomes and their implications to our daily lives. I’ve had time to reflect on this during past few events I’ve attended (the Cold Spring Harbor Personal Genomes Conference, the Open Science Summit, and a BioCurious Advanced Personal Genomics I helped teach) and while knee-deep in writing a chapter on ethics for “Exploring Personal Genomics” (see below).
The Personal Genomes conference at CSHL (which, incidentally, is joining the Pharmacogenomics conference next year to become “Personal Genomes & Medical Genomics”) featured exciting methods and state-of-the-art analyses for personal genomics, but also clinical implications, including stories of the use of genome sequencing in aiding clinical decision making. Many there argued that the time for clinical sequencing is now, but the issue was raised of who would interpret genomic data. There are many pathologists interpreting laboratory data for diagnosis, but would they want to (and could they be trained to) interpret genomes? Later, at the Open Science Summit, the current practical implications were discussed, along with the ethics and current limitations (the “hopes and hypes”), but the feeling was still optimistic. Then, on Wednesday, at the Biocurious hacker space, the “Advanced Personal Genomics” workshop that I helped teach provided a glimpse into genome interpretation for early adopters. The response at this event was surprisingly positive: even after discussing the limitations of personal genomics at length, these individuals were still curious and enthusiastic about learning more about their genomes.
From our end, our paper describing the Interpretome system is online now, to be presented at the Pacific Symposium for Biocomputing in January. The paper serves as a rigorous description of the platform, including examples of personal genomics analyses and the modular nature of the system. Furthermore, Joel Dudley and I have been working on a book entitled “Exploring Personal Genomics,” a guide to understanding and interpreting a personal genome, to be published in 2012 by Oxford University Press (anyone who would like to be notified when the book is released is invited to enter their email address here).
Yesterday, a paper on the analysis and interpretation of the genomes of a family of four was released in PLoS Genetics and featured in the Wall Street Journal, spearheaded by Rick Dewey and Euan Ashley. I was fortunate to be involved in this groundbreaking analysis, a logical next step to the clinical interpretation of Steve Quake’s genome last year in the Lancet. Collaborating on this paper got me thinking about analysis of family genomes in the age of GWAS (Genome-Wide Association Studies).
In the linkage studies of the past, researchers focused on families and segregation patterns of alleles to identify genes significantly linked with disease. These studies worked great for rare diseases, as they could focus on a single linked region/gene at a time. But for complex/multigenic diseases, the segregation patterns of a disease are not as clear, and the GWAS community has stepped in to tackle these problems on a larger scale. However, the genetic basis of only a few diseases have been successfully mapped by GWAS (to, say, greater than 50% of the genetic variance explained by the factors identified in the studies), such as age-related macular degeneration, and the bulk of diseases and traits have come up short. For complex diseases, the difficulty is the same as before: with so many unaccounted-for variables, we are back to a needle in a haystack problem. There is a great potential for combining family data with GWAS-based methods: in an analogous method to Sarah Ng and Jay Shendure’s identification of disease genes in rare diseases by exome sequencing, the ability to “subtract out” some of the noise (that may be family-specific) may result in more reliable results. Specifically, an unaffected family member may be used to down-weight the SNPs in common with an affected subject.
Looking at the genomes of the whole family at once in a clinical assessment context (applying results from large studies to a smaller number of individuals) was crucial to this analysis. At the most basic level, simply estimating the error rates is highly aided by the sequencing of multiple family members: knowing that the likelihood of AG and GG parents will have an AA child is vanishingly small gives us a confidence level for the SNP calls we do make. Then, when it comes to assessment of disease risk, analysis of multiple family members demonstrates the exact problem of complex diseases. While both parents may not be at risk for a disease, the exact combination of alleles passed down can confer a greater risk than the average of the parents. It is precisely here that genetic risk has a potential to trump family history in clinical analysis. At present, family history is a great predictor of clinical outcome, as it encapsulates much of the uncharacterized risk conferred by genetics. However, as our understanding of the genetic factors of disease increases, the genetic profile can incorporate something the family history cannot: the precise pattern of allele segregation. Finally, a family analysis can allow for phased genomes, which can inform the presence of “compound heterozygotes,” or cases where both alleles of a gene are affected by 2 different SNPs. While each of these may not be damaging on their own, the combination of both alleles may render both copies of the gene ineffective.
As the availability of genome-wide methods rapidly expanded, analysis of families seemed to go out of fashion for a while. Of course, we will need sophisticated informatics methods to tease out the signal from the noise, and these would not be trivial. However, with the current trends of the cost of genotyping and genome sequencing, a dataset of 100 families with a common disease is not out of sight. Then, of course, the clinical assessment of a family genome is another challenge, to which this paper brings a novel perspective, and it will be fascinating to follow the further development of these methods.
First, let me start off by saying thanks to everyone that has explored their data on Interpretome so far. We’ve had a tremendous response to the site and I couldn’t be more thrilled. I wanted to provide an update on certain perspectives I’ve gotten from scouring the web for reactions to the analyses on the site.
It seems that many have enjoyed the Ancestry and Neandertal analyses, and to be honest, these are some my favorites too! They truly are a fascinating look into the role of ancient DNA and human migration patterns throughout history. Plotting yourself on a world/continent map can really give a perspective on where you’ve come from. My (not surprising) Polish ancestry jumps out on the POPRES dataset, clustering among Polish and Northern Europeans.
For those looking to explore their own, for most of the datasets, plotting PC1 vs. PC2 with any number of SNPs (the more, the better) should give good results, assuming you are somewhat similar to at least one of the populations in that reference panel. This means that Africans will likely find interesting results from the African PCA, but it is uninterpretable for Europeans. (As an aside, the POPRES dataset is best run with PC1 vs. PC4 and using the relevant platform, 43K for v2 and 74K for v3).
For the Chromosome Painting, at present, there is no “right” set of parameters. We use a heuristic/approximation algorithm to determine the ancestry tracks and we are actively developing more robust methods (as well as adding more distinct, i.e. less admixed reference populations). The challenge is to provide an accurate tool that can be run in your own browser (without too much computing power or sophisticated custom software). At the moment, the Hapmap 2 painting should work reasonably well: tuning the parameters will affect the sensitivity, at the cost of some noise.
Also coming soon is a new method to illustrate disease risk analysis, grouping SNPs by disease to visualize them easier. While we don’t intend to provide any actual predictive analysis, our mission is to provide the tools needed for anyone to explore their genome. We hope that this will educate the public (including clinicians and scientists as well as hobbyists) about the power/potential power/limitations of a personal genome and enable individuals to do and share their own custom analysis.
In the race for the $1,000 genome, the issue of the $1,000,000 interpretation has not been forgotten. Combing through millions of variants in a personal genome has presented numerous challenges for all parties involved: the physician looking to add genomic measurements to inform their diagnoses, the patients trying to figure out what they should be worried about, and the hobbyists interested in what their DNA means to them. Direct-to-consumer genetic testing companies such as 23andme, Lumigenix, and Navigenics offer a glimpse into the interpretation of a genome. These companies curate literature on gene-trait associations and provide attractive user interfaces to navigating a personal genotype. However, the interpretations offered by these companies often differ, not because of inherent differences in technologies, but in which variants they choose consider in their calculations. While many have taken this fact to indicate a weakness of the genetic testing industry (and indeed, it is one that needs to be addressed), it is also a reflection of the dynamic nature of the field.
The site provides an open-source framework for personal genome interpretation, demonstrating the power of genotyping for ancestral and clinical analysis (though it should be noted that this service, like 23andme, should not be used for diagnostic purposes and is not approved by the FDA). We also feature an exploratory section, mirroring the lectures of the course, as well as other analyses of interest and an option to upload your own analyses. In here, you’ll find the fun Neandertal calculator, to calculate your number of alleles likely derived from Neandertal (according to Green et al.).
As new results like this one pop up, this open framework will allow users to customize their analyses based on their interest. Explore your genome at www.interpretome.com.
As some of you may know, for the better part of this year, I have been involved in the organization of BCATS 2010 (Biomedical Computation at Stanford), a one-day student-run conference on Stanford’s campus. Yesterday, I had the great privilege of being the chair of this interdisciplinary conference, which featured 14 talks, 10 spotlight/bullet talks, and close to 50 posters (abstract book). However, this conference was not your typical themed meeting. Rather than a conference where everyone in attendance was an expert in the specific narrow field of the conference, BCATS attendees all spoke the same language of computational and statistical analysis, but applied these methods to very different problems across biology and medicine.
The day started with Zemin Zhang from Genentech, for whom we are very grateful to have had deliver the first keynote address, on his work in genomics and computational biology. Dr. Zhang brought a perspective on the power of genome sequencing in understanding the complex biological basis of cancer. From there, the first group of student talks focused on computational approaches to the study of systems biology, from analysis of transcriptomics data to an integrative model of a whole cell. The second session focused on analysis of existing datasets to learn about drugs and learning about their effects and interactions with other drugs. This was a particularly interesting session, as the speakers presented work that applied mathematical and statistical methods to a topic that everyone could understand without too much technical knowledge: clinical use of drugs, including things that can go right and things that can go wrong.
In the afternoon, BJ Fregly from University of Florida provided a fascinating look into personalized therapy for osteoarthritis through simulation of arthritis development, rehabilitation treatment, and forces on knee joints and muscles. In the keynote and the session that followed, the talks stressed the importance of biological models for developing our understanding of various biomechanical and biochemical processes. The final session’s talks brought together multiple sources of data for representation and interpretation of clinical data, reminding us that direct application to a clinical setting is never too far off.
All in all, the conference brought together researchers from across the fields of biocomputation in a unified setting. While the experience taught me more than I’d like to know about the logistics of organizing such an event (along with everything that can go wrong at the last minute), as soon as I sat down and started listening to the first talks, I was reminded of the quality of the science that really goes on here at Stanford. On a more personal note, I would like to thank my co-organizers, Rob Tirrell, Jessica Faruque, Amir Ghazvinian, Matt Demers, and Keyan Salari for all their help throughout, as well as our sponsors and volunteers for their support. It was a great day and I look forward to next year’s conference.
This morning, Genomes Unzipped launched phase 2 of their website: a dive into the analysis of personal genomics. Today, this began with release of their personal (23andme and Counsyl) genetic data, as well as a snazzy-looking genome browser targeted to personal genetic data. While playing with the site, the same lesson dawned on me that I’ve noticed a number of times before (especially during the Personalized Medicine course): personal genomes are so much more interesting when they are personal. Tools like the genome browser (and their forthcoming analysis code) are instantly more useful, entertaining, and (most importantly) educational/illuminating when exploring ones own genotype data.
On top of this, ensuring open access of data, along with openness of genome research projects, is essential to progress. While consent issues are, of course, extremely important, addition of any phenotype information is crucial to the success of genetic discovery programs: one can only imagine how this would have made the already-powerful 1000 Genomes Project even more powerful. Genome-wide trait-wide association studies, based on open communal analyses, have the potential to transform the landscape of genetics and heritability. So hats off to Daniel MacArthur, Luke Jostins, and the whole Genomes Unzipped team for getting this project moving. I look forward to seeing what seeing comes out of the data and the experience in general.
A few weeks ago, Mike Snyder gave the last lecture of Personalized Medicine and Genomics (Genetics 210) here at Stanford University, a course for which I’ve had the privilege of being a teaching assistant. In this pilot program, Stanford medical and graduate students were taught about the state-of-the-art in personal genetics and given the option to get themselves genotyped. While we are still analyzing course survey data and it has not yet been established if the course will be offered again (in its current form), one thing is clear: everyone involved in the course, from the students to the teachers, from the proponents to the critics, learned something about genetic testing. The San Francisco Chronicle did a great job covering the class before and after, but I thought I’d cover a bit more of the details.
33 students (the TA included) learned something about their personal genetic risks. Whether it was an genetically estimated risk of prostate cancer from the population average 17% to a personal 24% or an “increased-risk” designation for hypertension, these students now understand and can connect with the contribution of their genetics to their personal health. While the concept of an odds ratio could have been presented starting with statistics and ending with an anonymous number that confers a disease risk to an anonymous group of people, instructors Keyan Salari and Euan Ashley presented this concept to students with their own genetic reality, making the results of the students’ analysis real and tangible. When someone sees they have a TT genotype at a locus and the disease is more common in people who have a TT at the same locus (than others who may have an AT or AA genotype), it makes them think critically as to what this means in general, for the population, and to them. Importantly, the students now understand that a genetic test is not a diagnosis. It is a scientifically informed estimate of disease risk, based on the application of published scientific studies (which students were taught to scrutinize and analyze critically, with all the reasons a study like this may fail or at least be incomplete) to a personal genome. While there exist a few conditions for which genetics plays a majority role, most results provided by a DTC genetic testing company (such as 23andme) confer moderate risks. A typical result may involve an increase from a 1/6 chance of getting prostate cancer at some point in life to a 1/4 chance. At first glance, the suggestion that this person is “high risk” for a disease may sound scary to the uninformed. However, if a person were told the risk for prostate cancer for individuals of his race and ethnic background were 24% in his population (while only 16% overall), this would likely not cause undue stress, but he would be more informed and consider earlier screening options.
The same goes for the students response to various drugs. In a lecture on Pharmacogenomics, Russ Altman brought his expertise of gene-drug interactions and made it personal. It’s easy to say “different people respond to drugs differently based on genetics,” but it is not until one sees their own genetics suggesting an increased sensitivity to warfarin that one can pause and say “If I didn’t know my genetic factors for warfarin dosing, I might be prescribed too much (which can cause side effects such as hemorrhaging).” Even if this person is not currently taking warfarin, it is easy to discern which of these two statements are more effective in learning about pharmacogenomics.
These students also learned something about what their DNA can tell them about their ancestry. In Carlos Bustamante’s lecture, running PCA and admixture methods, they observed where they fell on the “genetic map.” Most of the time, this was not news to the students: after all, ancestry is not a particularly anonymous trait. However, seeing the power of these methods to detect differences between populations and separate even an individual’s genome into African, Asian, and European derived sections demonstrated the extent of diversity among individuals who are otherwise 99.93% similar. Of course, some students observed results that were not as straightforward and could only be explained by dissecting the methods employed, a personal foray into scientific analysis.
Along these lines, one particularly important lesson students learned was a scientific look at studies involving genetic information. For instance, students observed the shortcomings of genetic information, such as the inability to significantly predict a fairly heritable trait, height (although it should be noted here that the instructors learned a valuable lesson about science education, that not everything works out as planned; in the course of height prediction of this particular sample, genetics were better able to predict height than the prediction based on parents’ heights, the opposite of the typically reported result). In addition, with Stuart Kim’s perspective on the genomics of aging, the students explored the scientific methods behind these studies, through a closer look at the centenarian prediction paper, which was published while the class was in session and subsequently questioned by the scientific community.
The course ended with Mike Snyder’s vision on the future of personal genomics and what will happen when the cost of a full genome sequence falls to consumer-affordable levels. At this point, an individual may have knowledge about his or her rare variants or the rarest variants, “private” mutations, where annotation and information is not as readily available as studied and annotated SNPs. The students were encouraged to think about this largely unexplored area which will no doubt become integrated with broader aspects of biology and science.
Before the students decided whether to get genotyped or not, they were presented with perspectives (from genetic counselors and ethicists, Kelly Ormond, Louanne Hudgins, Hank Greely, and Mike Grecius) and asked to think critically about the implications of knowing their personal genotype. This was no doubt an important aspect of the course, providing informed consent to the medical and graduate students undertaking to receive information with which they may not have been familiar. Once they made their decision, however, the informing did not end there. The lectures and exercises (and options for genetic counseling) encouraged students to constantly explore their genetics, to truly understand the basis of the information and what it means to them. With students invested in the analysis of their personal genetic data, we hoped to effectively teach the personal nature of genetics. While we have not yet fully analyzed the final effects of the course and the effectiveness of a genotyping option, at the very least, we believe we have dispelled some of the fears and controversy around genetic testing for these students. And hopefully, the 60 students that took time out of their summer to take an optional elective course learned something about personal genotypes.