The International Wheat and Maize Improvement Center (CIMMYT) is the world’s primary source of breeding material for wheat and corn (maize). Founded in 1943 and vaulted to international recognition in the 1960’s and 70’s, partly through the work of U Minnesota alum and Nobel prize Laureate Norman Borlaug, CIMMYT was the original model for the centers that now comprise the CGIAR. Over the years, CIMMYT has accumulated a large database of pedigree, trait and passport information for millions of wheat genotypes that include finished varieties, germplasm accessions, and (advanced) breeding lines. This information is critical for breeding programs to track the inheritance of desired traits (e.g., yield, disease resistance, drought tolerance, baking quality) for selections that are made for subsequent generations. The coefficient of parentage (COP), also known as the inbreeding coefficient, is a particularly useful metric when considering any two genotypes as potential parents for breeding. Formally this quantity indicates the likelihood that for any gene, the copies that occur in both genotypes are descended from a common ancestor.
GEMS Informatics worked with CIMMYT to analyze a subset of their large database, the 2012-2024 spring and durum wheat international nursery dataset. This dataset included 8.5 Million genotypes, 11.8 Million genotype aliases (e.g., internal cross names, commercial release names, abbreviations), and 18 Million genotypic relationships. We set out to accomplish 3 tasks:
- Build a python code repository with a scalable infrastructure. GEMS Informatics designed an approach using SQLAlchemy and SQLite that can accommodate 100’s of millions of genotypes and their relationships. This effort was successful and took 6 months. All CIMMYT’s data can be ingested in just 1 hour on a contemporary laptop.
- Resolve naming discrepancies identified in the CIMMYT data. GEMS staff analyzed all common_name and cross_name designations among the 8.5 million genotypes and putatively identified 544 pairs of genotypes in CIMMYT’s genebank that may be duplicative, and hence require consolidation in their database. It is a testament to the care that CIMMYT staff have employed that there were only 544 “typos” among the names for these 8.5 million genotypes. Examples include common typographical errors (e.g., C0723595 and CO723595; II53.546 and 1153.546), punctuation variants (e.g., 4715D(5B) and 47-1-5D (5B); DARTS-IMPERIAL and DART´S IMPERIAL), compound word variants (e.g., PLAN ALTO and PLANALTO; YANG MAI 6 and YANGMAI 6), language variants (ALGERIAN and ALGERIEN; FEDERATION and FEDERACION) and misspellings (e.g., AEGILOP UMBELLULATA and ARGILOPS UMBELLULATA; ATALANTA and ATLANTA; AUBAKOMUGI and AOBAKOMUGHI).
- Provide harmonized pedigree datasets and query capabilities via an API to CIMMYT. All of the following questions can now be answered via API queries to the database: What are the parents of any wheat genotype? What are the pedigree entries at any arbitrary level? (level 1 = parents; level 2 = grandparents; level 3 = great grandparents; …) What is the full recursive pedigree for any genotype? (ideally traced back recursively to landraces, if possible) What are the known aliases for any wheat genotype? What is the matrix of pairwise COP values for any pair or list of genotypes?
Future Work
Cleaning and harmonizing wheat pedigrees worldwide. Previously PedTools, developed by GEMS in 2017 could support modest pedigree sizes involving thousands of genotypes, initially targeted to wheat pedigrees for US and Canadian varieties. This work enabled the GEMS team to further enhance their PedTools infrastructure so that it can scale to collections with genotype counts numbering 10 million - 100 million. This makes it suitable to expand to CIMMYT’s full wheat genebanks as well as publicly accessible repos like GrainGenes and GRIN. Further, PedTools has the ability to match genotypes across organizations so it may serve to unify the wheat pedigree collections across countries, CIMMYT and public repositories, mapping accessions at each center to each other.
Additional crops: The original incarnation of PedTools was used to harmonize ~10,000 soybean pedigrees for a UMN soybean breeding project. Our soybean breeding collaborator is currently digitizing decades of old printed variety breeding information. So we plan to revisit that harmonizing effort with the new version of PedTools. Soybean breeders don’t use the Purdey notation (variety 1 / variety 2 // variety 3) for pedigrees, but instead utilize an arithmetic notation (e.g., ((variety 1 x variety 2) x variety 3)). With this in mind the underlying architecture of PedTools has been designed to accommodate a plugin of any custom set of rules to parse variety names and pedigrees for specific crop communities. In this manner, in the future we can write a new parser to ingest a new format of pedigrees and the rest of the PedTools machinery remains unchanged since the internal representation of varieties and their relationships is the same.
With this potential for expandability, we plan to engage pedigree data curators for other crops at various CGIAR Centers and elsewhere to help standardize their pedigrees and thus streamline and accelerate trait discovery and varietal development efforts.
Photo credit: A. Morgounov/CIMMYT.
Funding. The activities described here were conducted with support from the Government of Mexico and Minnesota State Government MnDRIVE funding made available to GEMS Informatics.