Science AMA Series: We’re NIH and UCSF scientists cataloging of all the genes and regulatory elements in the human genome; the latest stage of the project which aims to discover the grammar and punctuation of DNA hidden in the genome’s “dark matter.” AUA!



I've just learned about repetitive DNA sequences (LTRs, LINEs, SINEs, etc.) in my biology class. Do you think they serve any important function, or are they just parasitic "garbage DNA"? What would happen if they were all removed?


Great question and one that my lab is actually very interested in and has active research on! With time and a lot of cool research, repeats are being found to have important functions in our genome. Many of them have been what's called "exapted." This is a term used in evolutionary biology to describe a trait that has been co-opted for a use other than the one for which natural selection originally built it. There are several cases where repeats have been found to turn into additional exons of existing genes, or gene regulatory elements that regulate other genes and change genome structure. Of note also, in the new phase of ENCODE, what we call affectionately call ENCODE phase 4, there is actually a computational group, led by Ting Wang from Washington University in St. Louis, who will specifically study the role of repeats in gene regulation. - Nadav

Is there anything happening in the field of genetics that scares the crap out of you?


Sometimes people outside of biology seem unaware that both genes and environment matter, and they interact with each other. Learning about genes is very important and useful, yet not the whole story. People think that if they know all about their genes they will know everything about their biology, but not recognizing that the environment plays a huge part. You can't necessarily plan your whole life around a genetic test. - Mike

Is there anything happening in the field of genetics that scares the crap out of you?


My biggest fears in life are eye drops and chainsaws. As for genetics, the obvious current scare is using CRISPR/Cas9 genome editing to make customized babies with traits parents want. In terms of realistic fears, I rate this as a 7 on a scale of 10. -Nadav

Thanks for the AMA, and good luck with ENCODE. I'm interested in various fields of -omics, cancer, embryonic growth factors, and enjoyed the work of Biava.

I'm fascinated by the impact CRISPER is having, and am curious what kind of tool/technology is still missing that would improve/accelerate your work?


Good question. The CRISPR tool has really revolutionized the way we study DNA function. With CRISPR we are now pretty good at dissecting DNA function on a large scale using cell lines like embryonic stem cells or immortalized cell lines. However, it is still hard to do gene editing directly in a physiologically relevant way in primary cells – those taken directly from living animals – or in vivo. The major bottle-neck is to have access to those cell types and be able to deliver the genome editing tool into those cells efficiently. -Yin

As a high school biology teacher, what I've been telling my students for several years is that only about 1.5% of the human genome encodes proteins, and the rest is:

  • regulatory elements
  • genes for structural and regulatory RNAs
  • junk like pseudogenes and endogenous retroviruses
  • duplications of various kinds
  • stuff that may have a function but we have no idea what it is

As a high school level summary, was this a reasonably accurate picture of our knowledge of the genome ~10 years ago when I started teaching? What do you think the biggest revisions have been?


(edit: formatting)


This is a pretty good summary. The lessons we learned in the past ten years include: 1. There are millions of non-coding regulatory elements, a much bigger number than the protein coding sequences. 2. The regulatory elements are cell type specific and they are the major driving force for cellular identity. 3. A majority of the genetic variations associated with complex diseases are located in these regulatory elements, therefore mutations in these regions can play important roles in individual's susceptibility to diseases. -Yin

Previously we have held the belief that a lot of this 'dark matter' DNA was useless ("junk DNA") and it's only been more recently in the last five to ten years that we have realised a lot of what we previously thought was junk actually has function. Based on what you are doing how much of our DNA would you reckon is actually junk and how much of our DNA actually has a function? Further to this why do we have junk DNA to begin with, why doesn't our body get rid of DNA we have no use for?


Great question! Only 2% of our genome are genes that code for protein. Around 45% of our genome is actually made of what's called repeats, many of them viruses that were inserted into our genome. Various cool studies show that several of them have adapted new functions that made them 'stay' in our genome — like becoming parts of other genes or adopting a gene regulatory function (instructing genes when, where and at what levels to turn on). As for the remaining 53%, we see that a lot of it has regulatory function and other functions which we still don't know and which are fascinating in my mind to uncover.

The history of this field is also really fascinating – I recommend this article that does a great job describing when researchers first recognized the role of non-coding regulatory regions in the DNA (earlier than you might think!) -Nadav

Hello, thank you for doing an AMA series. What would be a good book to understand more about our genome? I have some intro biolology and genetics books but they seem kind of outdated.


The Deeper Genome, John Parrington; Homology, Genes, and Evolutionary Innovation, Gunter P. Wagner -Mike

Hello, thank you for doing an AMA series. What would be a good book to understand more about our genome? I have some intro biolology and genetics books but they seem kind of outdated.


For a more lay-oriented audience, I would recommend "The Gene: An Intimate History" by Siddhartha Mukherjee -Elise

UCSF Grad here. Two questions:

1) What one thing took you too long to learn in your scientific career, i.e., what piece of wisdom would you like to impart to young Ph.D. candidates?

2) Are you using fruit fly or mouse work to get a head start or do comparisons, or are you looking at human elements only?



To answer the 1st question: Stay focused and be persistent. There will be a lot of distraction along your scientific career both at personal and scientific levels. It is important to identify the scientific question you would like to answer and keep working on it. -Yin

UCSF Grad here. Two questions:

1) What one thing took you too long to learn in your scientific career, i.e., what piece of wisdom would you like to impart to young Ph.D. candidates?

2) Are you using fruit fly or mouse work to get a head start or do comparisons, or are you looking at human elements only?



I'm a UCSF grad, too! To your second qustion specifically, modENCODE studied the worm and fly epigenomes and transcriptomes. ENCODE studies human and mouse. - Mike

  • So, do you now agree that a confirmed chemical activity at a site is not equivalent to the site being functional? And that, no, not 80% of our DNA is functional?
  • How did your point of view evolve on that matter since the 2012 controversy? How dissimilar were your own points of view as compared to the official press releases, saying that everything is functional?

(for those interested in this controversy:

we need as biologists to defend traditional understandings of function: the publicity surrounding ENCODE reveals the extent to which these understandings have been eroded. )


ENCODE 2 found biochemical signatures at 80% of the genome, adding up all signatures for all cell types. This was an important first pass. However, if one looks at particular biochemical marks (such as DNase) that are markers for particular candidate functions (regulatory DNA), the numbers are quite different (in this case about 10%). An important part of ENCODE 4 will be its specific focus on examining candidate elements to determine whether, when, and where they function in important human cell types. This will be the task of the new ENCODE characterization centers, two of which Yin and Nadav will be directing at UCSF. - Mike

I know that a major complicating factor in eukaryotic genetics is the extreme distance between some regulatory elements and their genes. What is your approach to identifying and characterizing the regulatory role of these distal regulatory elements? Is it based on sequence gazing?


A few approaches are being used by the field, including ENCODE. One is to ask what distant pieces of DNA interact physically, as sometimes this correlates with gene regulation. Another approach is to ask which candidate regulatory elements appear to be consistently active in the same cell types as neighboring genes. A third approach is to consider which candidate regulatory elements are in the same structural domain ("topologically associated domain" or TAD) as neighboring genes. - Mike

I know that a major complicating factor in eukaryotic genetics is the extreme distance between some regulatory elements and their genes. What is your approach to identifying and characterizing the regulatory role of these distal regulatory elements? Is it based on sequence gazing?


For ENCODE 4, the Shen lab is taking gene-centric approaches to functionally characterize potential regulatory elements. For example, we are using CRISPR to screen regulatory sequences with candidate genes labeled with a fluorescent reporter such GFP. By doing that, not only we will be able tell whether the element is functional, but also whether it regulates specific genes of interest. We hope by doing this on a large scale, we will be able to learn general rules to predict how non-coding DNA sequences function and how to predict their target genes. -Yin

ENCODE has, to the best of my knowledge, been focused principally on immortalized laboratory lines and carcinoma lines. The concern is, of course, that these are not well representative of the regulation present in 'healthy' cells, and that many of the elements identified are the result of disregulation.

Has a shift to systematic mapping of primary cultured cells or tissue samples started to occur? Is there any active project to investigate population variability of regulation across distinct human lineages?


Actually, ENCODE 3 has already released data from mouse and human primary cells, tissues and organs. For mouse, data was generated on a developmental time course from more than 50 biosamples that are explants from tissues or organs. ENCODE 3 also generated human data from over 150 biosamples that are tissues/organs, primary cells, pluripotent cells or their derivatives. Moving forward in ENCODE 4, there will be even more work on primary tissues and organs, both from healthy individuals and from those with various diseases in order to maximize discovery of candidate functional elements. Importantly, the human biosamples will be obtained from individuals who have given explicit consent for genomic research and for the release of resulting data in unrestricted access databases. That means that anyone who has internet access can use these data without delay! -Elise

Evo-Devo grad student here, it's great to see such an awesome genomics group for AMA!

Nowadays, genomic "dark matter" seems to be a heavy word implying a whole bunch of different things. Does your analysis include anything regarding transcriptomics or are you purely looking at "junk DNA"?


Our group is mainly looking at gene regulatory elements such as promoters and enhancers that regulate transcription. Several of them are actually transcribed and are being referred to as enhancer RNA (eRNAs). - Nadav

How do you plan to simulate different cell conditions in the lab? You mentioned using stem cells to explore different cell types. However, during a cell's life from embryo to mature adult to cell death, the activation of these dark regions might depend on the interaction of neighbor cells that may not be in an isolated culture. Will you break your analysis into stages like first looking at cells in isolation, and then in pairs, and so on?


This is a really good question. We should be always be careful about the system we are using for testing DNA function, given that most of these elements function in a cell type specific manner. It is important for us to use a specific cell type to study sequence function so that we know the specificities. For that reason, most of our studies are using cells in isolation. That said, it is possible to put our approach in a more sophisticated system e.g. a tissue or 3D mini organoids culture if we can successfully separate and analyze different cell types. -Yin

This is impressive and amazing work here. My question is what classifies a sequence as "biologically relevant", and is a relevant sequence always relevant?


Non-coding regulatory regions are often functional only in specific biological contexts, e.g., in specific cell types, during certain times in development or after particular environmental exposures. So a big challenge is assaying for function in the appropriate biological setting. If you don't find something has functional activity, it could be that you aren't looking for it in the right biological context or it's possible that those sequences have one function under one set of conditions and another function under a different set. It's also possible that we don't have the right set of tools to probe for the particular function. Or perhaps, it just isn't functional? -Elise

Do you think epigenetics are the future of medicine?


I can see epigenetics playing an important role in medicine in the future alongside more traditional tools. For people suffering from chronic, episodic diseases, epigenomic biomarkers might help us to understand whether they are responding to a therapeutic regimen, and to learn where they stand in the symptomatic cycle. For instance if someone is starting to experience an episode of depression, we might be able to say "hang on, this should pass soon" or "this could be a major episode, a hospital stay might be in order." This kind of insight could really help patients. - Mike

Given that the nucleus is a 3D 'sphere' crammed with DNA, have your mapping efforts or regulatory theories taken into account or tried to build a model that includes physical space? Do you have enough information to think about genomic secondary or tertiary structure?


ENCODE 3 mapping centers generated some 3D genomics data (, and this effort will be greatly expanded in ENCODE 4, with two mapping centers producing multiple types of 3D data in a large variety of cell lines. Computational groups both in and out of ENCODE will be using this data to better understand how 3D genome organization impacts gene regulation, and ultimately human health and disease. Many individual research groups are also tackling related questions, as is the NIH 4DNucleome project: -Dan

I'd like to ask you to expand on part of what was said in the post: "These crucial regulatory elements — such as promoters and enhancers — coordinate the activity of thousands of genes. Differences in these regulators help explain why skin cells and brain cells are so different, despite containing exactly the same genetic sequence."

As a layman, I'm looking for a analogy to help me understand this, and your mention of grammar seems to be a place to start. Could we take regulatory elements to be like emphasis in a sentence? For example, this sentence can have 7 different meanings, despite the same content, depending on where the emphasis is placed:

*SHE said she did not take his money. (It was not someone else who said it.)

*She SAID she did not take his money. (So I believe her.)

*She said SHE did not take his money. (But someone else did.)

*She said she did NOT take his money. (And thus she is still poor.)

*She said she did not TAKE his money. (But she won it gambling.)

*She said she did not take HIS money. (But she took someone else's.)

*She said she did not take his MONEY. (But she did take something else of his.)

(Sorry if listing each is pedantic, it's just my OCD.)

1.) Is this analogy appopriate, and if yes, then 2.) Can you explain how that emphasis happens within DNA sequences, as in skin cells and brain cells?

Thank you!


There are two parts to the regulatory code; there are specific sequences in DNA that can be controlled by regulatory proteins (analogous to deciding which "words" to say) and also how the sequences are combined (analogous to "grammar" or "syntax"). While the words are not fully understood, they are better understood than the grammar. It appears that some regulatory elements are like billboards, where all that seems to matter are the words ("coffee stop now"), and other regulatory elements require a precise order and spacing for the words ("Dessert stop now" is different from "Stop dessert now") Regulatory elements that are optimized for a particular function (eg. drive expression of the neighboring gene only during bone formation) paradoxically use individual sequences (or words) that are sub-optimal for protein binding ( -Mike

PhD genetics student at UF. I work with axolotls, whose genome isn't sequenced. There's another lab currently in the process of performing this task and we do have some transcriptome data to go off of.

I'm trying to create targeted mutations and such and so we have a cDNA library in the meantime for things like that. How would you go about designing a probe for a gene where you don't have ANY transcriptome data? Like for example, the transcriptome shows no p16, but the probes were based off of human DNA. p16 is also pretty damn important, especially considering that axolotls don't get cancer, so they must have it.

In addition to this their genome is 10 times bigger than the human genome. We want to test if there are multiple copies of some of the tumor suppressors. So far we think our best bet for detecting that is through southern blot... is there any other more precise ways to test this kind of thing? (I was recommended FiSH but I'd need the genome sequence in order to make a probe for it... right?)

Thanks for any help or insight into this!!!


Tough problem. Agree with you. However, with sequencing costs becoming so low, is there a tissue/cell line that you think p16 might be expressed in that you can do RNA-seq on? With this technology becoming very cheap now (some companies quoting $250 per experiment) this might be a good starting option. Another option is doing some degenerate PCR on cDNA from regions of p16 that you think would be conserved in evolution that you can get from other organisms where sequence is available. -Nadav

To the UCSF group: How can you afford to live in the bay area and still eat?


GREAT question! Scientific seminars tend to have food acompanied to them. Grad students have actually made websites where these can be tracked on a daily basis. -Nadav

Have you ever come across some sequencing data that just didn't make any sense? Most likely a contaminant or some other boring explanation, but is there something that just sticks in the back of your head after all these years as something that could be biologically cool ?


When I was working in the lab, I encountered this kind of sequencing data every day! But seriously, two things that make genomics so powerful (and fun): First, with one experiment an entire genome's worth of data are collected. There are all kinds of things in the data, just waiting to be found. Second, when researchers make this digital data publicly available, either through projects like ENCODE or resources like GEO, any scientist can access it and use it to address their own research questions. Genomic data are tops for hypothesis generating! -Dan

What do you like most about what you do?


The honor of working with very smart people that are deeply engaged and passionate about what they do. -Mike

Can you please describe an example of regulatory elements that exist within non-coding regions of DNA in the context of a specific pathology? Aside from the usual satellite type sequences that move in and out of coding regions and mess things up.

How will the knowledge you generate be used to inform scientists in their work? Do you plan to create a resource or "map" for understanding regulatory elements in the genome? What will that look like?

How do you expect epigenetic regulation will affect non-coding regions of DNA? Can you talk about RNA induced transcriptional silencing? If a cell is transcribing so much DNA just to silence it, what is the purpose?

Are you going to look at structural DNA elements and how non-coding sequences might alter the fundamental structure of chromatin or is that more of a validation study?


Probably the most textbook example for this is the limb enhancer for the gene Sonic Hedgehog (SHH). Mutations in this non-coding regulatory element that functions as an enhancer has been shown to lead to limb malformations in humans (, mice, dogs, cats and chickens. There are many other examples for other diseases like pancreatic agenesis (, hearing loss, cancer, neurological diseases and many others.

The ENCODE resource has been used to help find where the function is, what cell type is affected, what the target gene is, and what the upsteam regulators are ( For example, when people do genome-wide association studies (GWAS) for disease, over 90% of the associations are with noncoding sites in the genome. The variant that is associated is not necessarily the causative one biologically, it is just the variant that was used on the GWAS chip to identify the association. Having a map can help and has helped finding truly causative variants.

As for RNA induced silencing, GREAT questions! Definitely interesting why cells would invest energy to counter other energy. One potential cause in my mind is transposons. I think a lot of these systems probably originated to defend against transposon transcription and were adopted for other functions. Highly recommend reading Hiten Madhani's review in Cell (

Finally, in ENCODE4 there will be 5 characterization centers that will look at the function of these sequences as well as mapping centers that will identify candidate functional elements. -Nadav

Edit: Updated description of mapping centers to broaden scope (last sentence).

Is there any cross referencing between noted genetic changes and occurrences in history/society?

Meaning does a famine or trend genetically alter a human's offspring ? Is there a possibility that a splicing of a different species caused a fissure in our genomes? (Sorry if my terminology is off)


There are reports of selection for variants that protect against some diseases, such as malaria. Interestingly the variant that confers partial protection agains malaria confers risk of sickle cell disease, so there can be tradeoffs. Others have found variants that increase lactose tolerance, survival at high altitudes, etc.

A fascinating story that is not ENCODE work was reviewed here: A signature of recent evolution in humans was reported. The finding was selection against education attainment (perhaps you've heard of the fictional movie Idiocracy?). However, there is a much stronger environmental effect at work in the opposite direction, so the net effect is an increase in educational attainment each generation. -Mike

Can you comment on how the regulatory elements in the genome are disrupted in the case of cancer? What about the epigenetic modifications of the regulatory elements in cancer?

Do we know what the transposable elements do?

Why do humans have so many repetitive elements?

I am very intrigued by this research and will be following this closely.


For regulatory elements and cancer, there are a few different known mechanisms that come from studies outside of ENCODE. One is when mutations alter a regulatory factor gene, so that it is now targeted to different genes or does something different to those genes. Another is when mutations alter regulatory sites in DNA, so that a neighboring gene is now expressed at a higher or lower level. Perhaps the newest idea is sometimes gene boundaries themselves are altered, so genes begin to respond to their neighbor's regulatory elements. -Mike

Hi everyone. Congrats on your new funding and new centers for research. This is exciting new development into DNA research. Thank you for taking us further into the unknown with ENCODE. A couple of questions for you guys:

  1. Do you or have you considered leveraging "Big Data" to store all your regulators, their behaviors, properties, locations etc in databases to be able to later glean more overall patterns and insights as you gather more and more data? What do some of those data points look like? What types of insights do you hope to gain from this type of mining of your big data?

  2. How exactly do you make a cell "glow" to indicate it has been "turned on" by some property of a gene regulator? Does the cell actually light up?

  3. Do you think any of our "Dark DNA" or regulators have any effect on our telomeres? Could you potentially make a change to affect the rate at which we age?

Thank you in advance for your time.



By integrating experimental and computational approaches, we hope the big data generated by ENCODE can help us learn general rules of how non-coding sequences work. We make a cell ""glow"" by tagging a gene with a fluorescent protein, e.g. GFP so that when the gene of interest is expressed, the tagged GFP will also be expressed and the cells light up (but fluorescence is different from eg. bioluminescence). There are mutations found in promoters of Tert can affect telomerase length in cancer. There are also a few studies ultized ENCODE data for studying aging. -Yin

How do you see the application of your work?

For example how do you think it could help pharmaceutical companies create better gene splicing therapies?


A better understanding of the noncoding part of the genome can increase our ability to interpert the effects of mutations in these regions which can be a common cause of human disease. For example, if you look at all the genome-wide association studies (GWAS) that attempt to associate DNA variants with human disease, over 90% of them point to DNA variants in the noncoding portion of the genome.

As for pharmaceuticals, I see many potentials: 1) Developing better sequences to direct the transgenes that are used for gene therapy to specific cell types. There are hundreds of clinical trials now with adeno-associated virus, most of them using a general promoter that causes the transgene to be expressed in all tissues, which could potentially result in harmful side effects. 2) For your splicing question, we see a big difference in isoforms between tissues. Knowing in what tissues these difference exist and how these differences happen (what regulates them) could be extremely important for developing these drugs. 3) Differences in drug response between individuals, some of which that can lead to serious side effects. My lab has done some work on treating primary cells with drugs and then checking global changes in expression and in gene regulation. We see big changes due to drug response and some of these sequences have DNA variants that can influence how people respond to drugs. -Nadav

I was super confused by the the end of your prompt "AUA" instead of the usual "AMA!" xD. Soooo my question is, what's the most interesting fact about or related to the AUA codon?


There's always an exception. See this great paper:;jsessionid=9897E7D0B4B85E0878A6A34973A040DF.f02t03). Some eukaryotes don't use this as a stop codon. -Mike

Soon to be graduate student here:

Does your group have any plans to elucidate a "histone code" when looking at regulatory elements? I imagine you mostly wish to understand cis-regulatory enhancer regions and regions that encode non coding RNAs, but do you have any interest in adding an epigenetic component?


ENCODE, Common Fund REMC, and others have integrated data from different marks using software such as ChromHMM ( to learn what are the most common histone modification patterns. These patterns have also been correlated with other data (gene expression, open chromatin) and the large number of pioneering studies from individual labs in order to interpret these patterns. -Mike

do you guys need interns for summer 2017? i'm a junior studying genomics and molecular genetics at michigan state university.


There are opportunities for summer internships at NIH and more information can be found on NHGRI's training page: Also see Elise's response to username "NotAProgramAnalyst" for information about our Program Analyst program at NHGRI for you to consider after you graduate. Good luck! -Team

Additional Assets


This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.