Hi Reddit, we’re a group of scientists and engineers from the Chan Zuckerberg Initiative and we’re helping to build a Human Cell Atlas. Ask Us Anything!


Hey Reddit! We’re a group of scientists and engineers from the Chan Zuckerberg Initiative -- a philanthropic organization founded by Mark Zuckerberg and Priscilla Chan. We’re working to help cure, prevent, or manage all diseases by the end of the century. One of the ways we’re doing that is by helping to build a Human Cell Atlas -- a world-wide effort to map all of the cells in the human body -- think the human genome project, but for cells (of which there are 30 trillion) rather than genes (of which there are 20,000 or so). Our big-picture goal is to support a fully open project in which scientists can share their knowledge to assemble a parts list of the cells in the healthy human body, and we’re looking for people who are interested in collaborating to develop new computational tools in support of this effort.

We’d love to talk to you about this and anything else related to our work on the Human Cell Atlas. Here is a photo of the team. We’ll be back at between 10am - 12pm PT to answer your questions -- ask us anything!

Cori Bargmann, PhD -- Torsten N. Wiesel Professor and head of the Lulu and Anthony Wang Laboratory of Neural Circuits and Behavior at the Rockefeller University in New York. President of Science at CZI.

Jeremy Freeman, PhD — Neuroscientist, and Manager of Computational Biology at CZI

Deep Ganguli, PhD -- Computational Biologist

Katja Brose, PhD -- Neuroscientist, Science Program Officer

Bruce Martin -- Director of Engineering

Andrey Kislyuk, PhD -- Software Engineer

(PS -- If you want to learn more about the Human Cell Atlas, check out this recent podcast from JAMA.)

EDIT -- Hey folks, we’re signing off for now, but will check back now and again to answer additional questions. Thanks to everyone who participated!

A huge blind spot of some of the prior efforts to run high-throughput sequencing (e.g. ENCODE, Roadmap) was that they only used abundantly available cell types. This is great for a first pass and gave information about all sorts of cell types from muscle to blood to bulk neuronal tissue, but it's not so good when you're looking at functional cells that are incredibly sparse in the body, or for cells that are incredibly fragile and can be impacted by experimental handling.

The best example of this problem may be cells of the inner ear, which are completely missing from ENCODE/Roadmap data sets, and which require a lot of unique expertise even to access and harvest the cells for study. Isolating cells in other rare niches and microenvironments can be comparably challenging. Furthermore, we have good reason to expect that there are still undetected surprises in the body.

How is your initiative preparing to handle the technical challenges of catching cells that grow in rare and inaccessible microenvironments in the body? What are you doing to enable discovery of cell types and niches that may have been overlooked in the scientific literature to date?


You are exactly right that some cells are easier to find than others, but that doesn’t mean they are more important! The inner hair cells in the ear are absolutely essential for hearing, but there are only about 20,000 of them per person -- about 1 in a billion cells in the body. So it’s critical for the Human Cell Atlas find the rare cell types. This is a lot easier than it used to be, as biology has moved from bulk tissue analysis (like ENCODE) to single cell analysis. The strategy for HCA is to take each organ or each tissue within an organ, do a shallow sweep through many cells using methods like single-cell RNAseq, identify the categories, and then make sure that the categories are all represented as you go deep (for example, by sorting for rare cells to enrich them). The art will be shifting between shallow and deep analysis at the right times. (CB)

How different will it be from the existing Cell atlas created by The Human Protein Atlas?


The Human Protein Atlas is an incredible resource that asks the question of how the tissue abundance and localization varies for all human proteins. This huge effort is different but complementary to generating a Human Cell Atlas. An important distinction is that HPA is focused on characterizing information about unique proteins whereas the HCA seeks to characterize many types of cellular variation, including transcriptional, epigenetic, functional, and proteomic variation that make cells unique. (DG)

Could you start by explaining a bit about what you mean by a "map of all the cells in the body"?

Let's say that you find 3 different types of cell in some organ, how do you decide what the cells are? By shape/structure/appearance or by genes that are expressed?

What I think you're doing is (a) identifying a cell according to some criteria, eg. a fat cell, and then (b) looking at which genes are expressed in the fat cell in the heart vs. the subcutaneous fat cell. Is this correct? Did I miss any major parts of you project?

Thanks. :)


There are about 30 or 40 trillion cells in the human body, and they fall into different types (like you said, fat cells). The Human Cell Atlas is a project to find out what all the cell types are, the number of each, their location, their neighbors, and their molecular composition. Exactly what defines a cell type is depends on the factors you mention, like the molecules that it expresses or has the potential to express. It’s like a human genome project, but for cells. The goal of CZI science is to support basic science and technology that will allow us to cure, prevent, or manage all diseases by the end of the century. The human cell atlas is a perfect example of that kind of project. All human diseases have a cellular basis -- either some cell is not doing what it’s supposed to, or it’s doing something that it’s not supposed to do. The goal of the HCA is provide a foundation for understanding the cells in healthy human bodies, and then that can be used by every scientist and physician in the world to help study every disease. Scientists have been working on this project since cells were first discovered! But new technologies mean we can go much further. It’s happening now in lots of places, but it’s scattered. The new element is to pull everyone together to make this project scalable and sharable, so there’s a Human Cell Atlas for everyone to use. [CB]

My biggest question really is, why do you think this will be useful? We have had a fully mapped human genome for more than a decade and we still do not really have any idea what most of it does. Are we even ready or capable of using this kind of information in a worthwhile manner?

Further I am aware that we have a similar atlas for C. elegans or at least, we know exactly how every cell will progress from the first cell to terminal differentiation. Do we have evidence humans are that consistent? I imagine the variability in humans would be several fold larger, how would you go about classifying "normal" cells?

Finally, how are we defining healthy? There are thousands of mutations with subtle or currently undetectable phenotypes which could cause cellular or sub-cellular changes and some of these are wide spread. Do you plan to use SNP frequencies and other genomic markers to help normalize what is considered a normal cell?


Science takes time, and that’s why our goal is to support projects that will have an impact by the end of the century. We’re pretty confident that a knowledge of cells will advance medicine in a long timeframe. All human diseases have a cellular basis -- either some cell is not doing what it’s supposed to, or it’s doing something that it’s not supposed to do. The goal of the HCA is provide a foundation for understanding the cells in healthy human bodies, and then that can be used by every scientist and physician in the world to help study every disease. A great example of why we think this will work is that current advances in immuno-oncology grew directly out of Jim Allison’s basic science research that identified new cell types in the immune system.

For humans, you’re right that understanding variability between individuals is absolutely key. Cell numbers will vary, and as you point out the genetic features won’t be exactly the same in different people either. There will be a normal range around all of these numbers, because normal humans are different. This is part of the science and the challenge of human biology, and we’re psyched to go at it. (CB)

Once this research is concluded will it be available for public use or will it be proprietary to various companies?


Yes, definitely available! We and the entire team working on the project are committed to making all of the data in the reference atlas freely and openly available for public use. And when CZI funds experimental science we’ll require this from grantees. We are following in the footsteps of other large scientific projects that made data openly available -- and accelerated science in the process. (JF)

Your team appears to be missing some key expertise: anatomists, physiologists, molecular and cell biologists, developmental and reproductive biologists, immunologists, etc. Are there any plans to bring people with these backgrounds onboard your team?


We absolutely need the contributions of experts from many fields in the Human Cell Atlas. Not just the ones you mention, but also physicians, and expert pathologists, who have some of the deepest insights into human biology. Rather than doing this project in-house, our goal is to support the scientists and physicians who are doing this work, wherever they are. Our requests for grant applications are open to scientists everywhere in the world, and we encourage all interested parties to get involved. In-house at CZI, we’ll be focusing on software engineering for shared data platforms and tool development, and on computational and analytical approaches that complement the research community, so that’s where our own team will grow in the near future. (CB)

How many cells have you already mapped at this time?

Also, How long does it usually take to process and map a single cell?


The project is just getting started, and is focused on technology and method development. Typically, many cells are processed together as a single batch -- today, that’s thousands of cells at a time, but the batch size is growing quickly. Processing a cell usually includes cell sample collection, preparation, sequencing and primary data analysis -- the experimental parts usually take around a week, and the computational part per cell is less than an hour. But the subsequent analysis and interpretation of the entire dataset can take months or years, and it is expected that the Human Cell Atlas data will have ongoing utility for decades. (BKM)

Will this Atlas be open and available to all scientists for research purposes?

Edit (for clarification and expansion): Will the data be open to everyone (scientist or not)? And are there barriers (such as political or government interference) that may prohibit scientists from being able to access and use the data?


Definitely open! We and the entire team working on the project are committed to making all of the data for the reference atlas freely, openly, and widely available to everyone, without any barriers. We’re following in the footsteps of other big science projects where data were open and it accelerated discovery. (JF)

What other applications of your work could be mapped on to or integrated in to Facebook's own initiatives to grow and develop their business? Or is your project 100% autonomous from Facebook's business goals? Thank you.


Yes, this project, and the entire CZI organization, is 100% independent from Facebook. Thanks for asking! (JF)



This is absolutely not about about helping organizations increase profits :) This is all about working with the scientific community to make the biggest impact we can, and doing so in an open way. Our LLC structure allows us to do that with the greatest amount of flexibility. As an LLC, we can do things like fund experimental scientists (like the Human Cell Atlas), fund nonprofit services for science (like bioRxiv), make acquisitions that we can then make freely available to scientists and the public (like Meta), and participate in advocacy work. [Team]

Will the map at first only be for adults, or will you try to map out the migration of cells from the time of embryogenesis? How will you collect the information, e.g. autopsies, data aggregation from previously performed research? How/when will the public and scientific community at large get to have a look, at the conclusion of the project, or will we see interim findings?

That's a really cool project. Good luck!


Thanks for your encouragement! First, you’re getting right to the heart of why it’s so challenging to work on the Human Cell Atlas, or any part of human biology. Even a first draft will need to recognize that there are difference between people, and there won’t be one atlas that describes everyone. So this will happen in stages, where there are early drafts and increasing information over time. The Human Genome Project has gone through a similar evolution, from one “consensus genome” draft to the “1000 genomes project” and onward.

That leads into your next question about where the human tissues will come from. The goal is to have data from many different sources in the Human Cell Atlas. That could include previously-generated data, and new tissue biopsies, and data from donors who agree to donate tissues after their death (who are unsung heroes of science).

Finally, we strongly support a reference Human Cell Atlas that is completely open to the public and scientific community, with data added in real time at intermediate stages. This will only be possible with tissue donors who consent to make their data available, under proper legal and ethical guidelines. (CB)

What makes you think that this is even possible? As far as we/I know this has only been done for C. elegans and is possible because of strict cell lineage and rigid positioning of cells. What evidence has been published for humans?

Also, how will you do it? You can't experiment as work researchers did? Is this "Atlas" a misnomer?


For humans, you are 100% right, we are not going to map out all 30 or 40 trillion cells one by one. Furthermore, that number probably isn’t even exactly the same for you and me, unlike worms. So what we want to do is to find out how many different cell types there are in the human body (which is probably going to be pretty much the same for everyone, except for differences in reproductive organs in men and women), and how many of each, and where they are in tissues, and what their neighbors are, and what their molecular composition is. There will be a normal range around all of these numbers, because normal humans are different. We will be supporting researchers who will be doing the actual experimental work in the Atlas, not doing it ourselves. (CB)

With 30 trillions to choose from, what cells are your first priorities, after getting the technology to work?


Human Cell Atlas is a large global consortium of researchers with priorities being set by the scientific community. CZI is supporting these efforts but not setting the priorities. The fields of immunology, neuroscience, and developmental biology developed many of the important tools and technologies that make HCA conceivable. However, a reference atlas of human cells will be important for understanding the cellular causes of disease - an interesting application is understanding how the immune system shifts from a normal state in response to neoplasia.

Wow, this project seems ambitious. As a (very junior) computational systems biologist, my immediate question is could you see your cell atlas being incorporated into tissue- and organ-level models? And would there be any difficulties in this kind of extrapolation with cross-compatibility, given that the cell models will be generated by multiple research groups?


Great question! One of the biggest computational challenges in the project is integrating data from different modalities and different research groups. There’s new work needed both on algorithms for integrating and normalizing different datasets, and visualizing datasets that span different spatial scales. The RFA we just launched (https://chanzuckerberg.com/initiatives/rfa) will support groups helping to build solutions to exactly these problems! (JF)

I noticed there are a lot of neuroscientists in your group. Is the brain a particular target of your efforts?


The Human Cell Atlas is a project to study the whole body, not just the brain. We’re not doing own own experimental work at CZI, but supporting external scientists, and we expect that support to be quite broad. Our science team has experience in neuroscience, genetics, developmental biology, cell biology, and immunology, but more importantly we’re supporting work on every organ to build the atlas. (CB)

So define "map". Are you just sequencing everything? If so are you going to assemble it or leave it fragmented? Then transcriptomics? And proteomics? If you do proteomics how are you going to get around the lack of sensitivity for most techniques like mass spec? I guess I'm just generally curious about what your methods will be and what your standard is. I know some professors who would spend years on one cell because their standard is to high so how do you make it realistic? Also are you guys using single cell sequencing techniques or just using cultures?

Also looking at some of the other comments, tons of people seem to be negative in their response. Sorry about that. I guess the general fear is data distribution and access rights. But hey, if your intentions are solid then I think what you guys are doing is great. Thanks!

Edit: had another question. What will you do to keep the project scalable and malleable? Honestly the biggest downfall I see when comp scientists approach biology is that they expect everything to be logical and follow some set of defined heuristic and laws that can be used to predict actions in novel situations. Which it does, but often times this changes from cell to cell. the truth is that every time we assume we know how something works in biology, we find some circumstance that disproves this and if we wrote code to predict an action we'd have to go back and change it. So how will you make it open to change in the future?


Great questions! The goal of the project is to measure as many cells as possible with robust and diverse techniques. Although we have started with single cell transcriptomics, because the technology is maturing rapidly and readily available, we are excited to support research into promising cutting edge methods. We hope to encourage gold standards in both methods and algorithm development through benchmark datasets and community agreement. To keep the project scalable and malleable, we are developing a data processing and analysis platform, entirely in the open, that is modular, and that anyone can contribute their code and ideas to with only light community governance. [DG]

Have you considered BOINC (Berkeley Open Infrastructure for Network Computing) as a resource to help with computational problems?


We have not considered BOINC, but it is an interesting idea and may help individual researchers compute on the data set. The core HCA project requires that data and analytical results be online and publicly accessible. We expect the total data set will be very large (tens of petabytes), creating a unique set of data access challenges. The HCA software team are focused on making data access (relatively) easy, and are building the compute and data storage infrastructure on the public clouds (AWS, GCP, etc). There will be a rich set of data access API and a default set of analytical results to accompany the primary measurement data. Our goal is to enable analysis compute to come to the data, and to support both small and large scale analysis. (BKM)

Is there any particular reason you want to focus on mapping all the possible cell types, rather than the proteome (all possible proteins). It seems that knowing all the proteins are far more important in disease than mapping a niche ear cell


The protein content of a single cell is one of many important aspects of what defines that cell as unique. An HCA will define cell types and the heterogeneity within the reference human at the single cell level - it is important to know about the proteins that make that niche ear cell unique and important for human physiology. The goal of the HCA is to define differences among cell types and understand not only the human proteome but how it can be employed by cells to create different functional states. Similar efforts, such as the Human Protein Atlas, are looking specifically at generating an atlas of the proteome and how it varies in different tissues and cells. An HCA will be an integrated atlas that hopefully will include the proteome information that you are looking for but offers windows into how this is related to other key aspects of cell identify. [jc]

How is information from your study being released? And will new discoveries be withheld for monetary means?

I'm looking forward to this study because I believe this will assist in the understanding of not only how the body works, but also ways disease can be treated. My worry, similar to many on this AMA, is that information will be exploited for monetary gain, rather then being released world wide for global advancement.


The Human Cell Atlas project and CZI are committed to open science and data sharing will be released freely and openly on an open data platform, accessible to all. New discoveries will definitely not held back for commercial or monetary reasons!! (KB)

Why is it so difficult to identify cells? Why are we still discovering new ones? Just today I read about a new type of brain cell discovered: https://www.sciencedaily.com/releases/2017/08/170810141712.htm


Biologists have been working to characterize and classify cells into distinct types for over a century based on detailed descriptions of their properties, e.g., their location and relationship with other cells, biological function, and molecular components. All of these measurements have been driven by advancements in technology, starting with the light microscope and culminating in today's high throughput molecular profiling techniques. To begin to identify and discover new cells, we first need a way to capture all this data in a useful and open format that is amenable to integrative analysis and maintains pace with measurement technology development. Furthermore, we need to develop scalable algorithms that can effectively mine this data to identify cell types that is robust to potential confounds. We will tackle these problems under the human cell atlas project to help accelerate progress into identifying cell types! [DG]

Do you think it is ok to patent natural occurrences in nature??

Do you believe it is ok to deny people something that might cure them or help them because they don't have enough money??


We do not believe money should be a barrier to individual health. Our mission at CZI is to advance human potential and promote equal opportunity, and we’ve made science one of our priorities because it is an area where systemic barriers often prevent individuals from reaching their full potential.

Is this all about anatomy, or function too? Neuroscientists I know say we don't even understand the retina yet and it's vastly more accessible than any part of the brain. What kind of high-level understanding of cell function do you realistically hope to achieve? Will understanding anatomy alone help medicine?


Great question! The initial goals of the HCA are focused on building an anatomical atlas, defining the the molecular and cellular signatures of cells across the body but the ultimate goal of this atlas is as a foundational resource for understanding function. We can use this knowledge about molecular signatures to help us access those cells for functional studies. For instance, by knowing the genetic signature of a cell, we can use this information to ablate those cells and test what happens when a cell (type) is removed. Or we can use these signatures to genetically engineering the cell, to delete genes or misexpress genes in that cell. Down the road, for some diseases, this type of information could be useful for developing gene therapy treatment strategies. We can also use these signatures to express molecular tags that will let us visualize and track the cell type. We can then follow those cells during development and disease to understand how that cell changes. You mention the retina and this is a great example of a system where structure and function inform one another. There have already been some interesting studies in the retina where scientists have been able to remove specific cells (based on their molecular signatures) and show effects on functions such as direction selectivity, motion detection, etc (KB).

What concrete steps can this project or Facebook take to ensure that such technology is equitably accessible to everyone who can benefit from it?


Equitability and accessibility is a really important goal for the project (to be clear, CZI is independent of Facebook, and Facebook is not involved with this project). To ensure equitability, we are conducting all of our R&D in the open, and require that the results of research funded by our grants are also published openly. Our goal is to create a reference atlas of cells in the human body, akin to the product of the Human Genome Project and the ENCODE project after it. We intend to make sure this reference data is as freely available as data from those projects before us, for all uses - academic, non-commercial, and commercial users alike. Because of the amount of data involved, we will also ensure that both data and processing pipelines are openly available on multiple public clouds (to enable anyone to “bring compute to the data” - using object storage as a high-performance I/O backbone) as well as downloadable. To address this on the molecular and wet lab assays on the upstream data generation side, we are also working with commercial assay and instrument vendors to encourage them to release their methods and protocols as openly as possible. [AK]

How do you think CRISPR Cas9 will play in to the modification of humanity within the next 100 years?


Absolutely but perhaps not in the way that you are imagining. There are clear ethical implications for the use of CRISPR/Cas9 that will have to be discussed among the scientific community and public at large. However, as a research tool there is little doubt that CRISPR as a genome editing technology will accelerate biomedical research and help us better understand disease, evolution and other fundamental aspects of our world. As a research tool and potential therapeutic (to correct genetic disease) this technology is likely to help us understand and potentially “modify” the course of human health and disease over the next 100 years. [JC]

How will mapping all the cells in a human body help you achieve your end goal of eradicating all the diseases by the end of this century?


We want to make every scientist a better scientist. There are many human diseases for which we don’t yet know the cause or the mechanism. We believe that by developing a true understanding of causes and/or mechanisms it will be faster and more direct to develop cures and therapies. [KM]

Our big-picture goal is to support a fully open project

How can we access the project?


The HCA is currently in the technology development stage, with general information available at www.humancellatlas.org. As data is generated, it will be available from that URL. The data management platform is under active development at github.com/HumanCellAtlas/. All code, design docs and slack channels are public and we would welcome participation. You can find links in github repo readme. (BKM)

I'm a Science Teacher here in the UK. I already have Rift + Touch (a Virtual Reality headset - from Facebook/Oculus by the way!). My question is I would love to teach my pupils about cells using VR. Is this something that you'd be able to envisage your project being able to produce - a cell learning tool in VR ?


That sounds super cool! We’ve recently seen a couple demos using 3D and VR to visualize cells, and heard from scientists who noticed something about their data by seeing in VR that they hadn’t seen otherwise. These tools aren’t quite ready yet for teaching purposes, but it’s an exciting area and we hope this project can support it. (JF)

What kind of scientific backgrounds do your team bring to the table? Is your team based strictly in life sciences or do you plan to utilize other sciences like data science as you move forward?


Great question! We’re a team with a range of backgrounds. The CZI Human Cell Atlas Team currently consists of a combination of biologists, computational scientists, and engineers. At CZI, we hope to have a differentiated impact by supporting more engineering, data science, and quantitative approaches, both by our own work and by funding external scientists and engineers. Our new RFA is specifically designed to support computational tools, algorithms, visualizations, and benchmark datasets for the Human Cell Atlas. If you have ideas, apply! https://chanzuckerberg.com/initiatives/rfa/ (CB)

I have several technical/operational questions regarding the CZI.

1.) Multiple papers have documented the issues with reproducing bioinformatic workflows from methodology descriptions in previously published studies or in online data repositories (for example, use of different parameters with NGS aligners like TopHat/Star/BWA). Are you considering implementing standardized workflows in order to ensure exact reproducibility of data analysis (such as a repository of Docker images documenting the exact analytical workflow)?

2.) What sort of sequencing chemistry are you planning on using for nucleic acid-centered portions of initiative (Illumina HiSeq SBS/Oxford nanopore/etc) ? Will you be enforcing the use of one platform to avoid inter-platform variation/biases? A lot of the studies cited in your Bioxriv white paper use relatively short HiSeq reads. What is your strategy for dealing with ambiguously mapping / multi-mapping reads (such as those from the human Polyubiquitin-C gene)? Will you be using some kind expectation-maximization algorithm like RSEM?

3.) What is the general strategy for incorporating emerging -omics techniques? Your Bioxriv paper mentions examining RNASeq data to characterize gene expression in different cell lineages. However, recent literature suggestions that there is a high degree of translational heterogeneity for similar transcripts in different cellular environments. Will the CZI be also incorporating ribosome profiling (RiboSeq) into its operational plan in order to account for this?

4.) It would seem that in high-throughput biology, there is a significant benefit to standardized workflows and automated workstreams (a la the Broad Institute's core facilities). Is the CZI planning on setting up any shared cored facilities in the future for production of proteomic/transcriptomic/genomic/etc. datasets?

Have a great day!


  1. We are working with the GA4GH, Broad Institute, and the Common Workflow Language group to develop better ways to enable portable, reproducible scientific workflows. Our data coordination platform has this as a top priority. We are currently considering using technologies like Dockstore, CWL, WDL, AWS Batch, Google Genomics Pipelines, etc. (which all utilize Docker in their stacks) to enable this goal.
  2. CZI itself doesn’t run sequencing assays, but HCA partner and funded labs do. Aside from using RSEM on HiSeq reads (which some of our pipelines definitely do!), there will probably be sequencing datasets coming from Oxford Nanopore and various other technologies. We focus on selecting and funding the most promising efforts using these new technologies, including sample prep, sequencing, normalization, and denoising projects and algorithms. Some of our funded project are specifically for comparing between them.
  3. We’re always looking at new experimental methods and assays, and expect that RiboSeq and other innovative assays will figure prominently in our funded projects.
  4. We agree wholeheartedly that doing science is easier and more powerful when you have large core facilities with standardized protocols, lots of automation and sophisticated monitoring. At the same time we must balance this against equitability, diversity and sourcing of scientific ideas from the wider community. Striking the right balance and working as a community toward gold standard pipelines and best practice protocols is a big priority for us. We won’t be starting major core facilities of our own, but we will be encouraging distributed reproducible science through agreement on these best practices and through technology sharing among project partners. (AK)

This AMA is being permanently archived by The Winnower, a publishing platform that offers traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in journals.

To cite this AMA please use: https://doi.org/10.15200/winn.150271.11474

You can learn more and start contributing at authorea.com


Thanks! We absolutely agree that not all scholarly communication happens in journals, and in fact, CZI has been actively investing in efforts to accelerate the pace of knowledge dissemination. We’re supporting bioRxiv a preprint platform for biologists for just this reason. (KB)

For Jeremy Freeman (and others too if they want to chime in): you've been working on open and reproducible science software for the past few years. What are your biggest takeaways from that work? To what extent do you think the challenges are technical, rather than social/political/bureaucratic?

What's next after scientists everywhere have become Software Carpenters writing Jupyter notebooks and sharing datasets on Dat?


Hey there! Such a good question. My main takeaway is just how hard it is to find the right abstractions when building software for science. In so many areas, there are lots of existing open-source tools that attempt to solve a similar problem but, collectively, don't quite nail it. This isn't because the tools are bad, it's because it's so hard to find the general solutions. This is a technical challenge, in that it's hard to identify common patterns and abstractions that satisfy all needs, can be broken out into reusable modules, and don't require endless customization specific to each lab. But it's also a social challenge, in that it requires coordination and collaboration, understanding one another's needs, and creating incentivizes to build shared solutions rather than go it alone. It'll be great when more scientists learn modern tools, but to really solve these problems, we'll need to work together as a community, both within and across labs. (JF)

What about differences between cell populations in the body in humans of different ages? Could it help in the study of aging? (Could perhaps attract funding?)


Great question! Yes indeed, we already know of some of the changes with age and this project will map many more. As we get a fuller picture of the ageing process, we will develop a deeper understanding and then we will be able to invent better treatments and therapies. KM

So something I never really understood about the human genome project that relates to your work is that according to my understanding every human cell has its own genome that's unique to that type of cell. If this is correct then doesn't that mean that to truly map the entire human genome one would need to map the DNA of every single type of human cell including ones that may only exist during specific developmental stages of human growth?


Actually that’s not quite right. Almost all the cells in our bodies have exactly the same DNA sequence. The most notable exception is in the immune system, where we edit out antibody genes. But that is one of the rare exceptions. Much more common is the fact that our brain cells have the same DNA sequence as our muscle cells. In summary: no we don’t need to sequence the genomic DNA over again for each cell-type. [KM]

Is this project focused on compiling the transcriptomes of different cell types? Or the methylation status of different genes? How will cell types be differentiated (by traditional cell marker genes or phenotypically)? If two cells of a single type have variability, will you include that information or simply the average?

Furthermore, will your team actually explore the data or simply compile and publish them?


All of the above! We’ll need to integrate data from many sources to fully understand cell types. This multimodal integration, and the question you raised about variability, are some of the reasons that the Human Cell Atlas project needs new computational approaches, algorithms, and visualization tools, which are the subject of our current Request for Applications. https://chanzuckerberg.com/initiatives/rfa (CB)

do you, as scientists, have any ethical misgivings about what you are doing? I get that a lot of people in this thread are really fired up about this. The most extreme views probably aren't completely justified, but do you, as individuals and scientists feel that there is anything that is ethically ambiguous or uncomfortable about what you are doing?

just curious to hear what your internal monologue is like when this kind of thing comes up.


We do think about ethical considerations. Our first step is for this work to be aimed at reducing human suffering by curing or at least mitigating disease. We are also taking steps to make sure that all of our donors have given their informed consent to make their data available, under proper legal and ethical guidelines. Finally, we are making all the data publicly available, so that any one anywhere in the world can make use of the information. [KM]

Hi I'm a soon to be software engineering graduate, would you be able to describe what his part of the team is and what the software is doing to help?


The HCA algorithms and software engineering teams span many labs and genome institutes, so working in a collaborative multi-team environment is a must. We do tasks ranging from “productionizing” algorithms for data processing into reproducible workflows, to engineering back-end systems to run those workflows on large fleets of instances and object storage buckets in the cloud, to devops (Development/Operations) of these systems and our development workflow, working with product managers and UI/UX specialists to gather feedback from scientists and make sure the software works for their experimental workflow, and so on. It’s a multidisciplinary software engineering team with people from many backgrounds involved. We use GitHub a lot - check out https://github.com/HumanCellAtlas.

As an aspiring computational biochemist, what programming languages do Drs. Ganguli and Freeman recommend being proficient in? I've heard recommendations from Python to R to Fortran.


Great question! We recommend being proficient in both Python and R for different reasons. Python is great for scientific programing and data analysis that can run in a production environment at scale. R is phenomenal for scripting models and data visualization. It also comes with some great libraries for bioinformatics! We also recommend learning some Javascript -- the web is increasingly important for things like data sharing and interactive data visualization, which will be super important for this project! (DG, JF)

I'm a post B.S. Individual, I studied Molecular&Cellular biology and General Chemistry as a double Major. (If you are hiring).

By Cell Atlas, do you mean to also plan to incorporate interactive & integrative technologies to visualize a standard cell type and its components, as well as the biochemical pathways, molecular trafficking, and secondary and small molecule signaling?

What are your biggest hurdles in this endeavor and what's the long term roadmap?

True visual roadmaps of the cell benefit undergrad Molecular Bio students. Many times the interactions they learn have small interactive connections and tell the story piece by piece. Being able to see more of the whole story (some of which we already know) would aid in not only their education but also research aspirations.


We’re super interested in new ways to visualize and integrate different information about a cell. That’s a theme of the RFA we just launched to support collaborative computational tools for the cell atlas (https://chanzuckerberg.com/initiatives/rfa/), and our collaborators at the Allen Institute for Cell Science are doing really nice work in this area (http://www.allencell.org/). The biggest challenge is probably integrating the enormous variety of both genomic and imaging data that we expect this project to generate, and the fact that so much of the data are complex and probabilistic rather than simple and discrete. (JF)

PS we are hiring on both our Science and Techology teams! Check out https://chanzuckerberg.com/careers

That's really commendable; but every cell is controlled via DNA. Should we not be looking more into that area? Flip a switch and a heart cell is now a liver cell. Well... doesn't just change into one. Just making a point here.


Well, most cells in the body (except some blood cells) have the exact same DNA, with different parts of that DNA transcribed into RNA in different cells. But maybe what you’re getting at is that some DNA is “marked” by methylation that is specific to that cell type. This marking is part of the biological process called epigenetics, and it’s one of the molecular signatures of cell type that will be part of the Human Cell Atlas. (KB)

Hey, thanks for the AMA!

How has machine learning contributed in this project? And what are the future expectations of the field in this research?


The human cell atlas community employs many machine learning algorithms for data mining, clustering, normalization, and visualization. We will ensure that models/methods that adhere to community determined best practices are available to run on the data at scale. Additionally, we hope to encourage the development of new machine learning algorithms by hosting hackathons and having benchmark datasets available for algorithm developers to test their ideas on [DG].

Super exciting initiative, thanks for being here!

This study found that six cell types compromised 97% of cells in the human body: red blood cells (70%), glial cells (8%), endothelial cells (7%), dermal fibroblasts (5%), platelets (4%), and bone marrow cells (2%). On the other hand, about half of all cells are from the microbiome.

Will your effort also map the microbiome? While I imagine that your classification is much more precise than the gross categories from the first paper, it seems that there is significantly more heterogeneity in foreign cells than human cells. This is implicated in all sorts of disease states--might these be missed if only human cells are mapped?

How do you define a cell type--some combination of physiology measurements + gene expression? Is there a reason to believe that cells can be reliably classified into one of 30 trillion labels? How, Epistemologically, can we justify cell types when e.g. stem cells might be classified differently depending on when they are sampled during their lifetime?

Many thanks!


These are fascinating series of questions - will try our best to touch on a few key aspects! One of the important findings of the extensive, and ever growing, work on the microbiome is how important the microbiome is for human physiology. As you point out, the microbiome is a huge component of physiology and will almost certainly be a key regulatory of key cell types - for example many cells in the gut are responsive to surrounding microbiota. Even if the microbiome is not measured directly by the HCA the effects of variation among individuals and populations will need to be accounted for. Characterizing individual cells will help us learn about important areas of variation and drive further experimentation about root causes. The second question about sampling and age will be very interesting to watch. There is no doubt that organs and their resident cells age. Also, not only will individual cells change but the ratio of cells and tissue architecture will change with age! Very complex but interesting questions. This area probably comes down to beginning to understand a baseline from which we can better understand the variation! (JC)

Are there any differences in what cells exist across different human beings, or do you expect that everything is standardized?

Also, same question, but for species? How much difference is there in the cells found in humans vs the cells found in primates?


These are both great questions. Variability across individuals is an important issue and one that the project is going to look at. Likewise, we’ll also learn a lot from looking at some of these issue across species. We don’t yet know very much about differences between humans and non-human primates but there is ongoing research looking at these issues, for instance, some work looking at differences between human and non-human primates in the brain. (KB)

What does your project timeline look like?


A first-draft HCA could be ready in 5 years. A more comprehensive atlas could be available in a decade. (AB)

Thanks for the AMA.

My question: if as a by-product of your research you found a cure for aging would you make it available to everyone?


We would indeed make it available to all people. We are committed to open science for the good of all [KM]

What defines a cell type, is there a clear distinction between them or is it sort of muddy?

How useful would such an atlas be when it's completed?

What are the most exciting research frontiers in biology?


Great question about cell types! A precise definition of cell type is still in progress. Personally, I think of cell types as clusters in a ‘feature space’ where the features can include anything from morphological to transcriptional information. Often, it’s unclear how distinct these clusters will be, and how much the clustering depends on the data considered versus the clustering algorithm and data pre-processing steps employed.(DG)

My question is for Deep Ganguli & Andrey Kisyluk. What are the specific roles of a computational biologist on a team such as this, and how does it differ from a software engineer? One uses the tools while the other makes them? Forgive the question for seeming a bit ignorant.

I'm currently a Cell & Molecular undergraduate with programming skills and would love to incorporate those skills into my career.


We have a number of software engineers and computational biologists on the team, and we definitely work together closely on the project. Software engineers focus more on building and deploying services that will process data "in production" - after a molecular assay and pipeline of algorithms to analyze its output has been agreed upon, we build services that upload, process, and distribute data to the public reliably and securely. Computational biologists focus more on staying close to the edge of knowledge and technology. They collaborate with external scientists and build prototype code to deeply understand their novel data pipelines and analysis problems, with an eye towards how to concentrate software engineering efforts in the future [AK, DG]

A few questions from interested collaborators:

  1. Can a proposal involve both creating new computational tools and testing them by generating our own cell data, or would CZI provide the data to be analyzed?

  2. If CZI is providing Cell Atlas data, then what kinds of data will be provided?

  3. Can Key Personnel be collaborators who are PIs of their own groups?

Thanks for your time!


Yes- key personnel can be collaborators even if they are PIs of their own group. However, if you consider your collaborator a PI for this particular project, they should submit their own application. Please see more details in our FAQs about collaborations for the RFA: https://chanzuckerberg.com/initiatives/rfa/faq

Hi Jeremy - Do you have any plans on organizing events similar to the CodeNeuro hackathons, or competitions like SpikeFinder and Neurofinder? It seems like the Human Cell Atlas will have very rich data that could benefit from a variety of people trying to tackle the analysis together. (Additionally, are you still supporting the CodeNeuro events, or are they going to end?)


Hey there! Yes, we’re definitely trying to take useful learnings from those projects and apply them here. Websites for comparing algorithms on benchmark data sets, and hackathons for collaborative development, are very much in the works for the Human Cell Atlas. And stay tuned for news on CodeNeuro :) Although neuroscience is not the focus of CZI, it’s an area and community I’m personally committed to and eager to support. (JF)

Isn't this kind of sisyphyian though because cells are dynamic objects that live on a continuous spectrum of differentiation, potentiation, licensing, activation, sénescence etc?


It’s definitely a challenge :) Your question identifies one of the biggest problems, which is the continuous, probabilistic nature of the data. Cells do fall into different categories, but also, exactly as you say, live along a continuous spectrum of state. We need high-throughput measurements that can span this range, new machine learning approaches that can simultaneously capture both aspects of the data, and new interactive visualizations to help us make sense of the complexity. This is exactly the kind of work we hope to support with the RFA we just launched (https://chanzuckerberg.com/initiatives/rfa/) -- want to come help us solve the sisyphian task? :) (JF)

Additional Assets


This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.