The Telomere-to-Telomere CHM13 Assembly of a Human Genome
GRCh38 was the most accurate map of the human genome. The entirety of human genetics relied on it. For example, it is the bedrock for finding disease-causing variants, deciding targeted treatments for cancer and studying how we evolved from our ancestors. But it is a flawed map. The technology used to generate it could only look at short fragments of DNA, much like building a puzzle with tiny pieces. Aganezov et al. (2022) have started to fill in the gaps with the release of T2T-CHM13, a more accurate map of the human genome that introduces 200 million new DNA bases and fixes errors in our current reference [1].
The reference genome is often compared to DNA from an individual. Improving the reference means that we can map your DNA correctly. The study’s authors reanalysed samples from the 1000 genomes project and found millions of new variants that were lost due to gaps and errors in GRCh38 (See Figure 5B and [2]).
The biggest impact on health and disease will come from understanding new genes. We study genes and the effects that variants have on the proteins they produce. Over 3600 new genes were introduced in this study, any of which could be a missing link in our understanding of the biology of human disease and development. These new regions include non-coding DNA. Here, variants in regulatory elements can lead to disease by shifting how DNA is interpreted. We can expect studies focusing on the function and impact of these new regions over the coming years.
While this is the highest-resolution map of the human genome to date, it doesn’t resolve every challenge. First, T2T-CHM13 misses chromosome Y. They used a cell line that contains chromosomes from two identical spermatozoa, simplifying the job for the genome assembly tools that figure out the reference sequence. It turns out these cells are only viable with two X chromosomes. An important technical limitation is that a few hundred genes are missing from T2T-CHM13 that are present in GRCh38, some due to lower copy number or poor alignment to CHM13. A more complete CHM13 will likely come as DNA sequencing technologies improve.
Providing a new reference genome doesn’t guarantee its adoption. When working as a bioinformatician in a clinical laboratory, it was never trivial to migrate to and validate a new reference genome. It is likely that many clinical laboratories still have yet to migrate to GRCh38, which has been available since 2013 (see Landson et al 2021 [3]).
Finally, a single reference genome cannot capture all of human variation. Given that T2T-CHM13 is from an individual of European ancestry, there will be biases introduced for some applications, such as looking for disease variants in individuals from a different ethnicity (see Sherman et al. 2019 [4]).
References:
-
Nurk, S., et al. (2022). The complete sequence of a human genome.
-
Aganezov, S., et al. (2022). A complete reference genome improves analysis of human genetic variation.
-
Lansdon, L.A., et al., 2021. Factors affecting migration to GRCh38 in laboratories performing clinical next-generation sequencing. The Journal of Molecular Diagnostics, 23(5), pp.651-657.
-
Sherman, R. M., et al. (2019). Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nature Genetics, 51(1), 30-35.