Download PDF

Comme nous l'avons vu dans la section 1, il est possible de détecter des ..... tion in multi-cross inbred-design : A case study of cereal breeding program. soumis ..... and by the mating systems used between .... a user's manual including the rules to ...... recipient genotype on non-carrier chromosomes (e.g. Hospital el al.
4MB taille 5 téléchargements 820 vues
No D’ORDRE :

´ Paris XI Universite UFR Scientifique d’Orsay

THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´e Paris XI Orsay

Sp´ecialit´e : Sciences de la Vie METHODES DE CONSTRUCTION DE GENOTYPES ASSISTEE PAR MARQUEURS

par Bertrand SERVIN

Soutenue le 4 D´ecembre 2003 devant la Comission d’examen :

M. P. BLANCHARD, Euralis G´en´etique Mme M. CAUSSE, Directrice de Recherche, INRA M. A. CHARCOSSET, Directeur de Recherche, INRA M. A. GALLAIS, Professeur, INAP-G M. F. HOSPITAL, Directeur de Recherche, INRA M. O. C. MARTIN, Professeur, Univerisit´e Paris XI Orsay M. P.M. VISSCHER, Professeur, Universit´e d’Edimbourg, UK

Examinateur Rapporteur Examinateur Examinateur Directeur de Th`ese Pr´esident Rapporteur

A village. Sound of chanting of Latin canon, punctuated by short, sharp cracks. A group of villagers who are dragging a beautiful YOUNG WOMAN dressed as a witch through the streets. They drag her to a strange house/ruin standing on a hill outside the village. A strange-looking knight stands outside, SIR BEDEVERE.] FIRST VILLAGER : We have found a witch. May we burn her ? ALL VILLAGERS : Burn her ! Burn her ! Burn her ! Burn her ! Burn her ! BEDEVERE : How do you know she is a witch ? FIRST VILLAGER : She looks like one ! ALL VILLAGERS : Yeah ! Yeah ! Burn her ! Yeah ! BEDEVERE : Bring her forward. [They bring her forward - a beautiful YOUNG GIRL (MISS ISLINGTON) dressed up as a witch.] WITCH : I’m not a witch. I’m not a witch. [...] BEDEVERE : What makes you think she is a witch ? SECOND VILLAGER : Well, she turned me into a newt ! BEDEVERE : [after a pause] A newt ? [Others stare and look at SECOND VILLAGER, who is plainly a human, not a newt.] SECOND VILLAGER : [Notices the stares. After a pause :] I got better. ALL VILLAGERS : Burn her anyway ! Burn her ! Burn her ! Burn her ! BEDEVERE : Quiet ! Quiet ! There are ways of telling whether she is a witch. [ARTHUR and PATSY ride up at this point and watch what follows with interest] ALL VILLAGERS : Are there ? There are ? What are they ? Tell us ! Do they hurt ? BEDEVERE : Tell me ... What do you do with witches ? ALL VILLAGERS : Burn them ! Burn them ! Burn them up ! BEDEVERE : And what do you burn apart from witches ? FIRST VILLAGER : More witches ! SECOND VILLAGER : Sh ! THIRD VILLAGER : Wood ! BEDEVERE : So why do witches burn ? FOURTH VILLAGER : [pianissimo] ... Because they’re made of wood... ? BEDEVERE : Good. [PEASANTS stir uneasily then come round to this conclusion.] ALL VILLAGERS : Oh ! Oh yeah ! BEDEVERE : So. How do we tell whether she is made of wood ? FIRST VILLAGER : Build a bridge out of her ! BEDEVERE : Ah ... but can you not also make bridges out of stone ? ALL VILLAGERS : Oh yeah. Oh yeah. Uhh... BEDEVERE : Uh, does wood sink in water ? ALL VILLAGERS : No ! No ! No ! It floats ! It floats ! Throw her into the pond ! The pond ! BEDEVERE : What also floats in water ? ALL VILLAGERS : ... Bread ! ...Apples ! ... Uh, very small rocks ! Cider ! Gra- Gravy ! Cherries ! Mud ! Churches ! Churches ! Lead ! Lead ! ARTHUR : A duck ! [They all turn and look at ARTHUR. BEDEVERE looks up very impressed.] BEDEVERE : Exactly. So... logically ... FIRST VILLAGER : [beginning to pick up the thread] If... she ... weighs.. the same.. as a duck ... she’s made of wood. BEDEVERE : And therefore ? ALL VILLAGERS : A witch ! A witch ! A witch ! Les Monty Pythons – The quest for the holy grail ou ”De l’Art du raisonnement scientifique”

“If Mendel’s law is true, it is worth millions of dollars to the breeders of plants in this country.” W.J. Spillmann (USDA) – 1904.

Table des mati` eres I

Introduction G´ en´ erale

IX

II Calcul de fr´ equences de g´ enotypes multilocus dans des p´ edigrees complexes XXVII III Principes d’Optimisation du Backcross Assist´ e Par Marqueurs XXXIII IV V

Optimisation de programmes de pyramidage de g` enes XXXVII Conclusion et Perspectives

XLI

VII

Premi` ere partie Introduction G´ en´ erale

IX

L’am´elioration des esp`eces cultiv´ees a pour but de produire des vari´et´es pr´esentant des caract´eristiques nouvelles pour des caract`eres d’int´erˆet agronomique (cr´eation vari´etale). La plupart de ces caract`eres agronomiques sont des caract`eres quantitatifs, c’est `a dire que leur variabilit´e est due `a la diversit´e g´en´etique de plusieurs g`enes, et est influenc´ee par l’environnement. Le but de la s´election est de combiner des g`enes influen¸cant positivement des caract`eres d’int´erˆet agronomique dans une mˆeme vari´et´e. De mani`ere classique, les individus sont s´electionn´es sur la base de leur ph´enotype, c’est `a dire de la r´esultante de l’expression de leurs g`enes dans un environnement particulier. La s´election ph´enotypique est donc une s´election indirecte sur le g´enotype, et l’objectif principal du s´electionneur est de pr´edire la valeur g´en´etique des individus. Cette valeur g´en´etique est celle qui va conditionner le r´esultat de la s´election de cet individu, les effets environnementaux ne se transmettant pas `a la descendance. En utilisant un mod`ele simple, la relation entre le ph´enotype (Pij ) d’un individu (i) mesur´e dans une condition environnementale j et sa valeur g´en´etique (Gi ) est : Pij = Gi + eij (1) o` u eij est un effet de l’environnement confondu entre l’individu i et la condition environnementale j. Suivant ce mod`ele simple, on constate qu’une mani`ere d’estimer la valeur g´en´etique d’un individu est d’observer son ph´enotype dans plusieurs conditions de milieu ci = Ej (Pij ) (Gallais 1990). j, de telle sorte que G Suivant ce mod`ele, la mesure de la valeur g´en´etique des individus est d’autant plus pr´ecise que le nombre des conditions environnementales dans lesquelles les ph´enotypes sont mesur´es est plus grand. Il faut noter que dans ce cas, le g´enotype doit ˆetre “r´ep´etable” `a l’identique ; c’est le cas des clones, des lign´ees pures ou des hybrides simples. Depuis longtemps, l’int´erˆet des s´electionneurs se porte vers une meilleure connaissance des g`enes impliqu´es dans le d´eterminisme des caract`eres s´electionn´es, en particulier pour pouvoir pr´edire la valeur g´en´etique des individus sur la base de leur g´enotype pour ces g`enes. Il s’agit `a la fois de connaˆıtre les effets de ces g`enes et leur localisation sur les chromosomes. En effet, en connaissant ces informations, il serait possible de d´eterminer la meilleure combinaison de g`enes `a mettre en place. On pourrait alors diriger les croisements entre individus pour obtenir le plus efficacement possible cette combinaison. Le but est alors de s’affranchir, au moins en partie, des effets environnementaux en s´electionnant directement les individus sur leur g´enotype. Pour pouvoir atteindre ce but, il faut donc remplir plusieurs conditions : d´eterminer le nombre et les effets des g`enes d’int´erˆet, les localiser sur les chromosomes pour pouvoir les manipuler et finalement savoir comment les combiner efficacement dans un mˆeme g´enotype. Dans cette optique, les marqueurs mol´eculaires sont un outil tr`es important. Les marqueurs mol´eculaires sont des locus dont le g´enotype chez un individu donn´e peut ˆetre facilement d´etermin´e. Il est alors possible de s´electionner les individus dans une population sur leur g´enotype aux marqueurs (S´election Assist´ee par Marqueurs, SAM). Cependant, ces marqueurs sont rarement les g`enes d’int´erˆet eux-mˆemes. Pour pouvoir utiliser l’information sur es g´enotypes aux marqueurs dans des programmes de s´election, il faut pr´ealablement d´eterminer les associations entre all`eles aux marqueurs et all`eles aux locus d’int´erˆet agronoXI

mique. Pour des g`enes impliqu´es dans des caract`eres quantitatifs (QTL pour Quantitative Trait Loci), la d´emarche consistant `a rechercher ces associations est appel´ee d´etection de QTL. Le terme anglais QTL mapping indique bien qu’il s’agit ´egalement de cartographier ces locus sur le g´enome. De la robustesse de ces associations d´ependra l’efficacit´e de la s´election sur le g´enotype aux marqueurs, et nous allons donc ´etudier bri`evement les m´ethodes et les r´esultats attendus de la d´etection de QTL.

1

Identification des associations g` enes / marqueurs

La principale utilisation des marqueurs mol´eculaires en am´elioration des plantes `a l’heure actuelle est l’identification de r´egions chromosomiques impliqu´ees dans la variabilit´e des caract`eres quantitatifs. Ces exp´eriences sont appel´ees exp´eriences de d´etection de QTL. De nombreuses m´ethodes de d´etection de QTL existent, qui diff`erent d’une part par les structures de population utilis´ees et d’autre part par les m´ethodes d’analyse statistique utilis´ees pour identifier les marqueurs li´es aux g`enes d’int´erˆet agronomique. Nous n’allons pas ici nous ´etendre sur les multiples m´ethodes de d´etection de QTL mais pr´esenter l’´evolution r´ecente de ces m´ethodes et leur lien avec la s´election assist´ee par marqueurs. Population bi-parentale en s´ egr´ egation La m´ethode la plus simple `a mettre en œuvre pour d´etecter des QTL consiste `a cr´eer sp´ecifiquement une population en s´egr´egation `a partir de deux parents compl`etement homozygotes. Dans ce cas, les all`eles parentaux sont en d´es´equilibre de liaison total dans l’hybride F1 issu du croisement entre les parents enti`erement homozygotes : chaque hybride F1 est form´e de l’union de deux gam`etes sur chacun desquels se trouvent tous les all`eles d’un mˆeme parent. Typiquement, cet hybride est alors soit auto-f´econd´e pour produire une population F2 soit re-crois´e avec un des parents pour produire une population backcross 1 (BC1). Dans ces populations en s´egr´egation, le d´es´equilibre de liaison entre g`enes est r´eduit par les recombinaisons survenues dans les gam`etes produit par l’hybride F1. Les all`eles s´egr`egent donc en fonction des taux de recombinaison entre locus. La population (F2 ou BC1) est alors constitu´ee d’individus portant des all`eles des deux parents, et au sein de laquelle il existe une variabilit´e pour les caract`eres agronomiques. Pour d´etecter des QTL dans ces populations la m´ethode la plus simple est d’effectuer une analyse de variance (ANOVA) `a chaque marqueur pour d´eterminer la part de variance du caract`ere expliqu´ee par le g´enotype des individus `a ce marqueur. Ainsi, si les diff´erences de g´enotypes `a un marqueur entre individus expliquent une part significative des diff´erences ph´enotypiques entre ces individus, cela indique que le marqueur est associ´e `a un ou plusieurs QTL (e.g. Sax (1923)). L’ANOVA `a un marqueur est une m´ethode simple mais : (i) elle ne permet pas de savoir si le marqueur test´e est li´e `a un ou plusieurs QTL et (ii) elle ne permet pas d’estimer pr´ecis´ement la position du QTL. Lander et Botstein (1989) ont propos´e une m´ethode statistique permettant de tester la pr´esence d’un QTL en n’importe quelle position du g´enome en utilisant l’information aux marqueurs flanquant la position consid´er´ee. Cette m´ethode est appel´ee Interval Mapping et XII

offre plusieurs avantages par rapport `a l’analyse `a un marqueur. En particulier, la position du QTL peut-ˆetre estim´ee et les effets des QTL sont mieux estim´es. Cependant, cette m´ethode pr´esente aussi des inconv´enients. En particulier, la valeur de la statistique calcul´ee `a une position ne correspondant pas `a un QTL peut ˆetre affect´ee par la pr´esence de QTL situ´es ailleurs sur le chromosome et d´epasser le seuil de significativit´e. Si il n’y a qu’un seul QTL sur le chromosome, ce probl`eme n’est pas trop important car la position la plus probable du QTL sera quand mˆeme bien estim´ee. Cependant, si plusieurs QTL sont sur le chromosome, la statistique sera affect´ee par tous ces QTL et il est probable que les positions et les effets estim´es des QTL d´etect´es soient alors biais´es. Il est possible de tester la pr´esence de plusieurs QTL sur un mˆeme chromosome par Interval Mapping, cependant, comme la m´ethode propos´ee par Lander et Botstein (1989) fait appel `a la d´etermination d’un maximum de vraisemblance, les r´esultats obtenus d´ependent beaucoup des hypoth`eses g´en´etiques ´etudi´ees et n´ecessitent des moyens informatiques lourds. Pour pallier ces probl`emes de temps de calcul, Haley et Knott (1992) ont eu recours `a une statistique de test moins gourmande en ressources calculatoires bas´ee sur la r´egression multiple plutˆot que le maximum de vraisemblance. Des m´ethodes plus complexes d’analyse statistique ont ´et´e mises au point pour prendre en compte la possibilit´e de pr´esence de plusieurs QTL sur le mˆeme groupe de liaison (Composite Interval Mapping, Zeng (1994) et Jansen (1993)), mais ´egalement pour permettre la d´etection de plusieurs QTL en mˆeme temps, et donc de d´etecter des effets d’interaction entre QTL (´epistasie) (e.g. Multiple Interval Mapping, Kao et al. (1999)). Cependant, le fait d’´etudier une population dans laquelle ne s´egr`egent que deux all`eles parentaux pr´esente des inconv´enients lorsqu’il s’agira ensuite d’utiliser les associations entre marqueurs et QTL pour produire de nouvelles vari´et´es am´elior´ees. En effet, en travaillant sur une base g´en´etique plus large, on pourrait s’attendre `a trouver plus de QTL pour un caract`ere donn´e et obtenir ainsi un gain g´en´etique plus important en augmentant le nombre de g`enes polymorphes sur lesquels la s´election peut agir. Pour cette raison, des ´etudes ont ´et´e men´ees pour d´evelopper des m´ethodes de d´etection de QTL dans des populations obtenues `a partir d’une base g´en´etique multiparentale. Population multi-parentale en s´ egr´ egation Pour pouvoir analyser des populations pr´esentant une base g´en´etique plus large, la m´ethode la plus simple consiste `a ´etudier une population constitu´ee de plusieurs populations de bases g´en´etiques ´etroites (typiquement bi-parentales). La d´etection de QTL peut alors ˆetre faite en utilisant des m´ethodes similaires `a celles expos´ees dans le paragraphe pr´ec´edent. Cependant, les mod`eles pr´ec´edents supposent des effets fix´es des all`eles parentaux et impliquent ainsi de distinguer autant de classes de g´enotypes qu’il y a de g´enotypes possibles aux marqueurs. Dans le cas de l’Interval Mapping, il faut prendre en compte tous les g´enotypes possibles `a deux marqueurs. Par exemple, avec seulement 3 all`eles possibles `a chacun des marqueurs, il existe d´ej`a 45 g´enotypes possibles aux deux marqueurs flanquant la position test´ee. Le nombre de param`etres `a estimer dans le mod`ele devient alors rapidement tr`es grand et les effectifs n´ecessaires pour les estimer trop importants. De plus, si le nombre de parents consid´er´es

XIII

est grand, l’hypoth`ese qu’ils portent tous un all`ele diff´erent `a un QTL devient tr`es forte. Il est donc plus r´ealiste de penser que certains de ces parents portent des all`eles provenant d’un mˆeme ancˆetre commun. Ceci pr´esente en outre l’avantage que le nombre d’effets all´eliques `a estimer est inf´erieur au nombre de parents de la population multi-parentale, et donc r´eduit le nombre des param`etres des mod`eles statistiques utilis´es. Cependant, il faut alors identifier les parents qui portent les mˆemes all`eles au QTL. Pour se faire, Jansen et al. (2003) ont propos´e de comparer les haplotypes aux marqueurs des parents de la population autour de chaque position test´ee : un haplotype aux marqueurs est le g´enotype aux marqueurs d’un individu sur un des gam`etes qui le constitue. Ainsi, un individu diplo¨ıde porte deux haplotypes. Dans le cas d’haplo¨ıdes-doubl´es, ces haplotypes sont strictement identiques. Dans la m´ethode propos´ee par Jansen et al. (2003) si deux parents portent les mˆemes haplotypes `a des marqueurs autour de la position test´ee, il sera consid´erer que les haplotypes portent le mˆeme all`ele au locus correspondant `a la position test´ee. Il faut noter que pour que cette approximation ne soit pas trop forte, il faut que les taux de recombinaison entre marqueurs soient faibles et donc cette hypoth`ese implique de travailler avec une densit´e de marquage importante. Pour l’analyse statistique, Jansen et al. (2003) proposent d’utiliser une m´ethode d´eriv´ee de l’Interval Mapping et d’estimer les effets des haplotypes parentaux. Dans cette d´emarche, il est ´evidemment n´ecessaire de connaˆıtre les haplotypes des parents et de leurs descendants. Si les individus ne sont pas homozygotes, il faut reconstruire leurs haplotypes aux marqueurs, c’est-`a-dire identifier le g´enotype de chacun des gam`etes qui constitue chaque individu. Cette reconstruction peut ˆetre entach´ee d’erreur ce qui peut repr´esenter un inconv´enient pour cette m´ethode. L’effet d’erreurs dans la reconstruction des haplotypes sur la d´etection de QTL utilisant cette m´ethode reste `a ´etudier. Une autre possibilit´e pour d´etecter des QTL dans des populations multiparentales est d’avoir recours `a des m´ethodes bas´ees sur le calcul des probabilit´es d’identit´e par descendance entre individus (IBD pour Identity By Descent). Le principe de ces m´ethodes est de consid´erer que deux individus identiques par descendance `a des QTL ont plus de chances d’avoir des ph´enotypes proches. Ces m´ethodes sont bas´ees sur des mod`eles mixtes o` u les effets des QTL sont suppos´es distribu´es selon une loi normale. Ces mod`eles ne n´ecessitent pas de supposer que le nombre d’all`eles s´egr´egeant est ´egal au nombre de parents et requi`erent donc un nombre de param`etres plus petit que les mod`eles `a effets fixes. Ces m´ethodes sont appel´ees “approches en deux ´etapes de l’estimation des composantes de la variance” (“two-step variance component approach”) (Hoeschele et al. (1997) ; Xie et al. (1998) ; George et al. (2000)). Un exemple de d´etection de QTL dans une population multi-parentale simul´ee suivant une variante de ces m´ethodes peut ˆetre trouv´ee dans Crepieux et al., (soumis). L’utilisation de ces m´ethodes permet de d´etecter des QTL dans des populations issues de pedigrees complexes, le facteur limitant ´etant de pouvoir calculer les probabilit´es IBD entre individus. Dans la m´ethode de Crepieux et al., ces probabilit´es ont ´et´e calcul´ee en utilisant le programme MDM (Servin et al. 2002), qui permet de calculer ces probabilit´es entre individus issus de populations bi-parentales en s´egr´egation produites par des successions de croisements arbitrairement complexes (e.g. lign´ees recombinantes et hautement XIV

recombinantes, backcross avanc´es, haplo¨ıdes doubl´es ...). Il faut noter que, par nature, ces m´ethodes ne donnent pas acc`es directement aux effets des all`eles parentaux. Une fois la position du QTL estim´ee, il est donc n´ecessaire de r´eestimer les effets des all`eles parentaux. Dans les m´ethodes que nous avons d´ecrites ci-dessus, les populations consid´er´ees sont constitu´ees d’individus descendant de croisements entre lign´ees enti`erement homozygotes. Au d´epart, il y a donc association totale entre all`eles aux marqueurs et all`eles aux QTL chez chacun des parents : le d´es´equilibre de liaison entre locus est maximal dans la F1. Au cours des g´en´erations suivantes, les recombinaisons diminuent ce d´es´equilibre de liaison entre locus. Cette diminution est fonction du taux de recombinaison entre locus. La d´etection de QTL dans ces populations est appel´ee analyse de liaison (Linkage Analysis, LA), car les distances (ou liaison) g´en´etiques entre locus sont les param`etres qui suffisent `a calculer les probabilit´es des diff´erents g´enotypes `a la position test´ee. Cartographie utilisant le D´ es´ equilibre de Liaison Une mani`ere d’analyser des populations pr´esentant une grande diversit´e all´elique est de regrouper des individus de provenance quelconque. Pour d´etecter des QTL dans ces populations, les m´ethodes d´ecrites pr´ec´edemment doivent ˆetre adapt´ees. En effet, si les individus n’ont pas des relations de parent´e enti`erement connues ou si ces relations sont trop complexes pour permettre des calculs de transmission all´elique, il n’est plus possible de prendre en compte directement les liaisons entre locus pour inf´erer les g´enotypes possibles `a une position test´ee. Dans ce cas, une solution est de recourir `a d’autre m´ethodes appel´ees “d´etection utilisant le d´es´equilibre de liaison” (LD mapping, pour une synth`ese voir Terwilliger et Weiss (1998) et Weiss et Terwilliger (2000)). Dans ce cas, le d´es´equilibre de liaison entre locus n’est pas connu a priori mais doit ˆetre estim´e `a partir des g´enotypes aux marqueurs des individus. Le principe est de supposer que des individus portant les mˆemes haplotypes aux marqueurs sont identiques par descendance entre ces marqueurs. Mais l’estimation des probabilit´es d’identit´e par descendance ne peut plus ˆetre faite par calcul de transmission all´elique. Meuwissen et Goddard (2001) ont d´evelopp´e une m´ethode permettant d’estimer les probabilit´es d’identit´e par descendance entre individus `a partir des haplotypes aux marqueurs et de la connaissance de l’histoire ´evolutive de l’esp`ece `a laquelle ces individus appartiennent (principalement l’effectif efficace de la population estim´e `a partir des haplotypes aux marqueurs, Hayes et al. (2003)). Une fois ces probabilit´es calcul´ees, l’analyse de mod`eles mixtes permet de tester la pr´esence d’un QTL en une position du g´enome. Le principal probl`eme de la d´etection de QTL par d´es´equilibre de liaison est qu’elle n´ecessite une densit´e de marqueurs tr`es ´elev´ee. En effet, dans des populations qui ne sont pas obtenues par croisement de lign´ees homozygotes, le d´es´equilibre de liaison entre locus diminue tr`es rapidement avec leur distance g´en´etique. Pour pallier ce probl`eme, (Meuwissen et al. 2001) ont propos´e d’utiliser `a la fois les informations sur les transmissions all´eliques donn´ees par la connaissance des pedigrees des individus et sur le d´es´equilibre de liaison observ´e dans les haplotypes aux marqueurs. Cette m´ethode est appel´ee LDLA mapping (pour Linkage Analysis and Linkage Disequilibrium Mapping). En utilisant cette XV

d´emarche, Meuwissen et al. (2002) ont pu localiser finement un QTL dans un intervalle entre deux marqueurs distant de 1 centiMorgans alors qu’en utilisant l’une ou l’autre des deux m´ethodes s´epar´ement, l’intervalle de confiance sur la position du QTL ´etait d’environ 10 centiMorgans. Conclusion Partielle L’utilisation de nouvelles m´ethodes statistiques et la disponibilit´e de cartes g´en´etiques tr`es denses (en particulier grˆace aux marqueurs SNP) permettent d’avoir une puissance de discrimination importante pour cartographier les QTL et donc d’identifier finement leurs localisations sur le g´enome. Muni de cette information, il est dans certain cas possible de profiter des r´esultats de la g´enomique sur les s´equences d’ADN avoisinant la position du QTL et d’identifier le g`ene correspondant. L’identification des positions des QTL et ´eventuellement des g`enes impliqu´es dans le d´eterminisme des caract`eres d’int´erˆet agronomique est une information qui peut ˆetre valoris´ee en s´election. Cette valorisation est l’objet de la s´election assist´ee par marqueurs. Le principe g´en´eral de la s´election assist´ee par marqueurs est de cumuler les g`enes d’int´erˆet agronomique dans un mˆeme g´enotype en les manipulant soit directement soit en agissant sur les locus marqueurs qui leur sont li´es. Nous allons maintenant d´etailler les principales pistes th´eoriques qui ont ´et´e explor´ees pour optimiser ce cumul.

2

Utilisation des associations g` enes / marqueurs en s´ election : la S´ election Assist´ ee par Marqueurs (SAM)

L’utilisation des associations entre QTL et marqueurs en s´election peut ˆetre envisag´ee dans deux cadres diff´erents. Les marqueurs peuvent ˆetre utilis´es pour augmenter la pr´ecision de l’estimation des valeurs g´en´etiques des individus candidats `a la s´election. Les marqueurs permettent ´egalement de suivre les ´ev`enements de recombinaisons au cours de croisements dirig´es. L’information que les marqueurs apportent permet ainsi de diriger le cumul de g`enes pr´ealablement cartographi´es. Cette d´emarche est appel´ee construction de g´enotypes.

2.1

Utilisation de marqueurs pour pr´ edire la valeur g´ en´ etique des individus

Comme nous l’avons vu dans la section 1, il est possible de d´etecter des associations entre les marqueurs et des QTL. Cette information peut ˆetre utilis´ee pour pr´edire la valeur g´en´etique des individus. Cependant, une certaine part de la variation du caract`ere n’est pas expliqu´ee par les diff´erences des g´enotypes aux marqueurs. Les locus impliqu´es dans la variation non expliqu´ee sont regroup´es sous le terme de polyg`ene. Les marqueurs ne fournissent pas forc´ement une information compl`ete sur la valeur g´en´etique des individus. Pour cette raison, Lande et Thompson (1990) ont propos´e d’int´egrer l’information donn´ee par les marqueurs dans l’estimation de la valeur g´en´etique des individus en construisant un XVI

index de s´election prenant en compte conjointement l’information ph´enotypique et l’information g´enotypique disponibles pour chaque individu. Cet index est : zb = b0 y + b1 s

(2)

o` u b0 et b1 sont les poids donn´e respectivement au ph´enotype et aux marqueurs ; y est le ph´enotype de l’individu ; s est une valeur nomm´ee molecular score qui tient compte du g´enotype de l’individu aux marqueurs et de l’effet additif sur le caract`ere associ´e `a ces marqueurs. Ce molecular score est calcul´e comme suit : X s= βbi xi (3) i∈M

o` u M repr´esente l’ensemble des marqueurs associ´es `a un effet additif significatif d´etect´e par r´egression multiple ; βbi est l’effet additif associ´e au marqueur i ; xi est le g´enotype de l’individu au marqueur i. Lande et Thompson (1990) ont utilis´e la th´eorie de la s´election sur index pour montrer que les poids b0 et b1 qui maximisent l’efficacit´e de la s´election pouvaient ˆetre d´etermin´es par : b1 /b0 = (1/h2 − 1)/(1 − p) (4) o` u h2 est l’h´eritabilit´e du caract`ere. C’est `a dire la part de la variance ph´enotypique due `a la variance g´en´etique additive, compos´ee de la variance associ´ee aux marqueurs d’effets significatifs et de la variance polyg´enique ; et p est la part de la variance g´en´etique additive associ´ee aux marqueurs. Lande et Thompson (1990) ont pr´edit analytiquement que l’utilisation de cet index permettait d’augmenter l’efficacit´e de la s´election en particulier quand l’h´eritabilit´e du caract`ere est faible et que les marqueurs sont associ´es `a une part importante de la variance g´en´etique additive. Cependant, pour des caract`eres d’h´eritabilit´e faible, il est difficile de trouver des associations significatives entre variation des g´enotypes aux marqueurs et variation des ph´enotypes. Moreau et al. (1998) ont ainsi montr´e qu’il existe en fait une h´eritabilit´e interm´ediaire optimale pour l’efficacit´e de la SAM. Construction du score aux marqueurs La m´ethode propos´ee par Lande et Thompson (1990) pour construire l’index de s´election zb consiste `a s´electionner les marqueurs `a utiliser dans le calcul du score aux marqueurs s. Dans leur ´etude, ils ont propos´e de d´eterminer les effets des marqueurs en effectuant une r´egression multiple des ph´enotypes des individus sur le g´enotype de ces individus aux marqueurs. Ceci permet d’associer `a chaque marqueur un effet additif sur le caract`ere consid´er´e. Seuls les marqueurs associ´es `a des effets statistiquement significatifs sont conserv´es pour le calcul du score aux marqueurs s. Diff´erents auteurs ont cherch´e `a am´eliorer la m´ethode de s´election des marqueurs (e.g. Whittaker et al. (1995) ; Whittaker et al. (1997)). Cependant Lange et Whittaker (2001) ont d´emontr´e que s´electionner les marqueurs ´etait toujours sub-optimal en particulier parce que cela entraˆıne une sur-estimation des effets qui leur sont associ´es. Ces auteurs ont propos´e de construire l’index de s´election en une seule ´etape en prenant en compte tous les marqueurs dans le molecular score. Ils ont montr´e que cette m´ethode permet un gain important XVII

dans l’efficacit´e de la s´election. Il est int´eressant de constater que la meilleure information `a prendre en compte dans cette optique est donc donn´ee par les associations marqueursph´enotype et non pas directement les associations QTL-ph´enotype. La d´etection pr´ecise de la localisation des QTL n’est donc pas un pr´e-requis fondamental pour ce type de s´election assist´ee par marqueurs. ´ Evaluation des effets des marqueurs Au cours des g´en´erations, les recombinaisons modifient les associations entre marqueurs et QTL et donc les effets associ´es aux marqueurs (Gimelfarb et Lande 1994). Id´ealement, les effets associ´es aux marqueurs doivent donc ˆetre r´e´evalu´es `a chaque g´en´eration. Ceci fait ´evidemment perdre de l’int´erˆet `a la s´election assist´ee par marqueurs. Hospital et al. (1997) ont sugg´er´e de r´e´evaluer les effets des marqueurs toutes les 2 ou 3 g´en´erations. Ceci permet d’alterner des cycles de s´election sur marqueurs seuls et des g´en´erations o` u les individus sont r´e´evalu´es ph´enotypiquement et les effets des marqueurs r´eestim´es. Cette alternance est particuli`erement int´eressante quand le ph´enotype des individus doit ˆetre ´evalu´e sur descendance. En effet, les ´etapes d’´evaluation ph´enotypique durent alors typiquement deux g´en´erations. L’utilisation de cycle de s´election sur marqueurs seuls permet d’augmenter le gain g´en´etique par unit´e de temps.

2.2

Construction de G´ enotypes : Utilisation de marqueurs pour cumuler des g` enes (QTL) dans un mˆ eme g´ enotype

La construction de g´enotypes est un cas particulier de s´election assist´ee par marqueurs o` u les individus ne sont s´electionn´es que sur la base de leur g´enotype : il s’agit de s´election sur marqueurs seuls. Dans ce cas, il n’y a pas d’´evaluation ph´enotypique au cours des g´en´erations de s´election ; le but est d’obtenir un g´enotype id´eal (id´eotype) le plus rapidement possible. La d´emarche implicite de la construction de g´enotypes se d´ecompose en deux ´etapes : 1. D´efinir l’id´eotype aux marqueurs. C’est `a dire le g´enotype de l’individu qui doit sortir du processus de s´election. 2. D´eterminer les moyens et les m´ethodes `a mettre en œuvre pour obtenir cet id´eotype le plus rapidement possible. La m´ethode de construction de g´enotypes la plus courante est le backcross assist´e par marqueurs. Le backcross est une m´ethode de s´election qui a pour but d’introgresser un g`ene issu d’un parent appel´e donneur dans le fond g´en´etique d’un parent receveur. C’est une m´ethode de construction de g´enotypes qui ne n´ecessite pas forc´ement d’utiliser les marqueurs pour ˆetre mise en place. Cependant, la s´election sur marqueurs permet ici d’acc´el´erer le processus de s´election tout en permettant de s’assurer de la qualit´e des individus produits. Les principes d’optimisation du backcross assist´e par marqueurs sont explicit´es sp´ecifiquement dans une partie de cette th`ese et ne sont par cons´equent pas d´etaill´es ici. Le backcross assist´e par marqueurs reste cependant un programme utilis´e pour manipuler peu de g`enes et dont l’application principale est l’am´elioration ponctuelle de vari´et´e. XVIII

Lorsque l’on d´esire manipuler beaucoup de QTL pour les cumuler dans un mˆeme g´enotype, le backcross n’est pas une m´ethode de s´election adapt´ee : d’autres m´ethodes de construction de g´enotypes doivent ˆetre d´evelopp´ees. Pour cumuler des QTL d´etect´es dans des populations biparentales, une solution possible consiste `a identifier les paires d’individus compl´ementaires pour les all`eles favorables `a des QTL. En croisant ces individus, il serait alors possible de trouver dans leur descendance des individus cumulant tous les all`eles favorables aux QTL pr´esents chez les parents. La figure 1 pr´esente le principe de cette m´ethode de construction de g´enotype. A la premi`ere ´etape, les meilleures paires de parents (typiquement des lign´ees recombinantes (RIL) ou des haplo¨ıdes doubl´es (HD)) sont s´electionn´ees et crois´ees pour produire une nouvelle population de lign´ees. Si l’id´eotype est trouv´e parmi ces lign´ees, la construction de g´enotypes est achev´ee, sinon, le processus est recommenc´e en s´electionnant `a nouveau les meilleures paires de lign´ees filles. Les cycles de s´election / recroisement sont r´eit´er´es jusqu’`a obtention de l’id´eotype. Population détection de QTL (RIL, HD)

Sélection des meilleurs croisements

Production de populations de recombinants (RIL, HD)

Oui

Contient l'idéotype ?

Non

Fin

Fig. 1 – Construction de g´enotypes par s´election r´ecurrente dans des populations de lign´ees recombinantes ou d’haplo¨ıdes doubl´es. Chaque cycle est initi´e en s´electionnant les paires de lign´ees de g´enotypes compl´ementaires aux QTL. van Berloo et Stam (1998) ont ´et´e les premiers `a sugg´erer ce type de sch´ema pour le cumul de QTL. Ils ont propos´e une m´ethode pour s´electionner les meilleures paires de RIL parmi toutes les paires possibles qui consiste `a calculer un index de s´election bas´e sur le g´enotype de l’hybride F1 obtenu en croisant deux lign´ees. Cet index est  X X CI = βi(c) × GF 1 (i(c)) (5) c

i(c)

o` u c est le chromosome consid´er´e XIX

i(c) est un intervalle du chromosome c contenant un QTL βi(c) est l’effet du QTL correspondant `a l’intervalle i(c) GF 1 (i(c)) est le nombre d’all`eles favorables de l’hybride F1 pour l’intervalle i(c) d´etermin´e en fonction du g´enotype des marqueurs de l’intervalle. Cet index est calcul´e pour chaque paire possible de RIL. Les paires de RIL pr´esentant l’index le plus ´elev´e sont alors crois´ees pour obtenir un g´enotype cumulant les QTL des deux RIL. La s´election sur l’index CI conduit `a s´electionner un ensemble de lign´ees qui est diff´erent de celui obtenu en conservant les lign´ees ayant les valeurs ph´enotypiques les plus ´elev´ees comme le montre la figure 2. L’int´erˆet des marqueurs est donc ici d’identifier les lign´ees compl´ementaires pour les g`enes impliqu´es dans les caract`eres s´electionn´es, ce qu’il n’est pas possible de faire par s´election ph´enotypique.

Fig. 2 – Croisements s´electionn´es au sein d’une population de RIL par A. s´election assist´ee par marqueurs B. s´election ph´enotypique. La s´election assist´ee par marqueurs permet d’identifier les couples de RIL compl´ementaires pour le cumul de QTL. D’apr`es van Berloo et Stam (1998) van Berloo et Stam (1998) n’ont envisag´e qu’un seul cycle de s´election de paires de RIL, qui ne permet g´en´eralement pas d’obtenir l’id´eotype. En effet pour pouvoir l’obtenir par croisement entre deux RIL uniquement, il faut que l’hybride F1 porte au moins un all`ele favorable `a chacun des QTL, ce qui est improbable lorsque le nombre de QTL est ´elev´e. Par ailleurs, mˆeme si un tel hybride F1 peut ˆetre obtenu, la probabilit´e d’obtenir XX

ensuite l’id´eotype sans s´election en un seul cycle avec des tailles de population raisonnables est tr`es faible. Pour pallier ces probl`emes, Charmet et al. (1999) ont propos´e d’it´erer le processus de s´election / recroisement entre RIL jusqu’`a obtenir l’id´eotype. Cependant, dans les deux cas, il faut noter qu’aucune s´election n’est effectu´ee dans les descendances des croisements entre les RIL, les nouvelles lign´ees ´etant obtenues typiquement par SSD (Single Seed Descent) ou haplo-diplo¨ıdisation. S´electionner directement les descendances obtenues apr`es croisement pourrait permettre d’augmenter progressivement la probabilit´e d’obtenir le g´enotype id´eal au cours des g´en´erations. C’est la d´emarche de la m´ethode propos´ee par Hospital et al. (2000) appel´ee MBRS (Marker Based Recurrent Selection). La m´ethode propos´ee par Hospital et al. (2000) suppose une population reproduite par panmixie au sein de laquelle les individus sont s´electionn´es uniquement sur leur g´enotype aux marqueurs. Pour obtenir leurs r´esultats, ils ont suppos´e une population de d´epart en ´equilibre de liaison pour les QTL `a cumuler. Cependant, leur m´ethode est applicable `a partir de n’importe quelle population, comme par exemple une population de RIL. A partir de cette population de d´epart, certains individus sont s´electionn´es comme reproducteurs et sont ensuite intercrois´es en panmixie. Au cours des g´en´erations suivantes, le processus est it´er´e, les reproducteurs ´etant s´electionn´es en utilisant une strat´egie de compl´ementation aux QTL (QCS, pour QTL Complementation Strategy). La premi`ere ´etape de la s´election suivant la strat´egie QCS est de calculer pour tous les individus de la population un score aux marqueurs. Ce score aux marqueurs est calcul´e en fonction du g´enotype de l’individu aux marqueurs flanquant les QTL. Les individus pourraient ˆetre s´electionn´es uniquement sur cette valeur, cependant ceci entraˆıne une perte des all`eles favorables `a certains QTL (Hospital et al. (2000)). La strat´egie QCS vise `a ´eviter cette perte en s’assurant qu’au moins nT all`eles favorables `a chaque QTL sont pr´esents dans la population de reproducteurs, le nombre d’all`eles favorables port´es par un individu ´etant d´etermin´e par son g´enotype aux marqueurs flanquant le QTL. Le choix des reproducteurs s’op`ere en d´eterminant le plus petit sous ensemble d’individus tel que le score aux marqueurs soit maximal tout en s’assurant qu’au moins nT all`eles favorables sont pr´esent `a chaque QTL. Le nombre de reproducteur minimal, c’est-`a-dire la plus petite taille possible du sous-ensemble, est un param`etre de la strat´egie QCS nomm´e N0 . Le r´esultat de la strat´egie QCS d´epend des nombres N0 et nT . Dans leur conclusion, Hospital et al. (2000) ont montr´e qu’une valeur de nT de 3 ´etait efficace pour maintenir un taux de fixation des all`eles aux QTL assez ´elev´e tout en minimisant le risque de perte de ces QTL. Le nombre minimal de reproducteurs (N0 ) a moins d’influence sur l’efficacit´e de la strat´egie QCS et peut ˆetre relativement restreint pour assurer un progr`es g´en´etique suffisant. Hospital et al. (2000) sugg`erent un N0 de 3. La strat´egie QCS permet de fixer en une dizaine de g´en´erations les all`eles favorables aux marqueurs flanquant 50 QTL en s´electionnant 3 `a 5 individus dans une population de 200. Cependant, quand les marqueurs ne sont pas situ´es directement sur les QTL, il existe une probabilit´e de perdre les associations marqueurs-QTL du fait des doubles recombinaisons survenues entre les marqueurs. L’efficacit´e de la s´election sur marqueurs est alors r´eduite et la fr´equence des all`eles favorables aux QTL est seulement de 92% dans XXI

la population. Dans le cadre de la m´ethode MBRS, une s´election est effectu´ee `a chaque g´en´eration de reproduction pour augmenter progressivement les fr´equences all´eliques au sein de la population, contrairement aux m´ethodes propos´ees par van Berloo et Stam (1998) et Charmet et al. (1999) qui proposaient des cycles de recroisement o` u la s´election n’est effectu´ee qu’apr`es obtention de descendances enti`erement homozygotes. La m´ethode de Hospital et al. (2000) est donc plus souple que les m´ethodes pr´esent´ees ci-dessus et permettent de manipuler plus de QTL. Dans la m´ethode MBRS, les croisements entre reproducteurs sont fait au hasard (panmixie). Certains de ces croisements sont utiles et permettent d’obtenir des individus pr´esentant des combinaisons d’all`eles aux QTL meilleures que leurs parents. En revanche, d’autres croisements sont inutiles et doivent r´eduire l’efficacit´e de la s´election. Pour am´eliorer l’efficacit´e du cumul des QTL, il faut d´evelopper des m´ethodes permettant de d´eterminer les meilleurs croisements entre individus. Il serait alors possible de d´eterminer la m´ethode de cumul optimale, c’est-`a-dire la meilleure succession de croisements entre individus permettant le cumul de tous les QTL. Pour l’identifier, il faut donc tout d’abord pouvoir explorer l’espace des solutions possibles. Il est ensuite possible d’´evaluer l’ensemble des solutions possibles et ainsi de trouver la meilleure, par exemple il est alors possible de d´eterminer la succession de croisement permettant d’obtenir l’id´eotype cumulant tous les g`enes d’une population donn´ee minimisant les tailles totales de population `a g´erer au cours du programme de s´election. L’article de Servin et al., inclus dans cette th`ese, pr´esente un cadre th´eorique permettant l’exploration de l’ensemble de ces solutions et un exemple de recherche du plan de croisement minimisant les tailles totales de population. Cette d´emarche est tr`es prometteuse pour optimiser les programmes de construction de g´enotypes. Cependant, la m´ethode pr´esent´ee dans Servin et al. est limit´ee par le nombre de g`ene qu’elle est capable de g´erer. En effet, l’´enum´eration de tous les programmes de cumul est limit´ee par le nombre total de solutions possibles, qui croit exponentiellement avec le nombre de QTL `a cumuler : l’´enum´eration exhaustive de plan de croisement destin´es `a cumuler plus de 9 g`enes n’est pas possible du fait du nombre tr`es ´elev´e de solutions possibles. Pour pouvoir traiter plus de g`enes, une solution est d’effectuer plusieurs programmes de cumul en parall`ele et de recumuler les sous-ensemble de g`enes dans une deuxi`eme ´etape. Une autre solution est de d´evelopper la th´eorie existante en optimisant la recherche du meilleur programme de s´election, c’est `a dire en ´evitant l’´enum´eration de toutes les solutions possibles sans remettre en cause la d´ecouverte de la meilleure d’entre elles. Des d´eveloppement th´eoriques sont donc encore `a effectuer `a partir du cadre d’´etude pr´esent´e dans Servin et al..

Conclusion et Perspectives Le d´eveloppement de m´ethodes efficaces de construction de g´enotypes permet de limiter les coˆ uts n´ecessaires `a l’introgression de g`enes dans des fonds g´en´etiques homog`enes (Backcross Assist´e par Marqueurs) et / ou au cumul de g`enes dans un mˆeme g´enotype. L’int´erˆet des g´enotypes produits par construction de g´enotypes peut ˆetre agronomique ; dans ce XXII

cas, la construction de g´enotypes est une m´ethode de cr´eation vari´etale. Ces g´enotypes peuvent ´egalement ˆetre produits pour ´etudier comment les g`enes s’expriment en fonction du fond g´en´etique dans lequel ils se trouvent et / ou comment les g`enes interagissent entre eux (c’est-`a-dire d´eterminer les relations d’´epistasie entre les g`enes). Le d´eveloppement de m´ethodes de construction de g´enotypes est donc important pour valoriser les r´esultats des exp´eriences de d´etection de QTL et/ou d’identification des g`enes sous-jacents (voir en particulier les articles de Lecomte et al. (soumis) et Thabuis et al. (soumis) inclus dans cette th`ese). Le d´eveloppement de m´ethodes de construction de g´enotypes est bas´e sur le calcul des probabilit´es de transmission all´eliques au cours d’un croisement dirig´e. Pour optimiser les croisements entre individus, il faut savoir quelles sont les probabilit´es que deux parents de g´enotypes connus transmettent leurs g`enes `a leurs enfants. La construction de g´enotypes se situe donc plus dans le cadre de la g´en´etique mend´elienne que dans celui de la g´en´etique quantitative. Les probabilit´es de transmission de g`enes peuvent ˆetre calcul´ees ais´ement lorsque un ou deux g`enes sont pris en compte ou lorsque les g`enes ne sont pas li´es g´en´etiquement. Cependant, pour d´evelopper des m´ethodes g´en´erales de construction de g´enotypes il faut pouvoir calculer les probabilit´es de transmissions de nombreux g`enes li´es. Le programme MDM (Servin et al. 2002), d´evelopp´e au cours de mon DEA puis de ma th`ese, permet d’effectuer ces calculs dans des configurations de croisement complexes et a donc ´et´e un outil utilis´e dans plusieurs de mes travaux. Les ´etudes sur la construction de g´enotypes ´existant avant ma th`ese ´etaient essentiellement destin´ees `a ´etudier l’optimisation du backcross assist´e par marqueurs. En effet, cette m´ethode de s´election est un cas typique de construction de g´enotypes, qui existait pr´ealablement `a la d´etection de QTL et aux d´eveloppement de marqueurs mol´eculaires. Cependant, la s´election sur marqueurs permet d’am´eliorer tr`es significativement les r´esultats de programme de backcross. J’ai donc bien ´evidemment travaill´e sur le backcross assist´e par marqueurs pour en ´etudier l’optimisation. La plupart des travaux effectu´es sur le backcross assist´e par marqueurs envisagent l’optimisation de certains objectifs de la s´election, pris s´epar´ement. Je me suis personnellement int´eress´e `a ´etudier l’estimation de la composition g´en´etique des individus produits par backcross en utilisant les informations donn´ees par les marqueurs (Servin et Hospital (2002) ; Servin , in prep.). Dans la partie de cette th`ese consacr´ee aux principes d’optimisation du backcross assist´e par marqueurs, j’ai repris les r´esultats de ces diff´erentes ´etudes pour d´ecrire une d´emarche d’optimisation globale. Ce document montre que le backcross assist´e par marqueurs est d´esormais une m´ethode de construction de g´enotypes qui peut ˆetre parfaitement optimis´ee. Cependant, la construction de g´enotypes ne se limite pas au backcross assist´e par marqueurs. En effet, le backcross n’est pas une m´ethode de s´election efficace pour cumuler de nombreux g`enes. Nous avons vu dans cette introduction les m´ethodes th´eoriques d´evelopp´ees pour cumuler plusieurs QTL dans un seul g´enotype. L’inconv´enient de ces m´ethodes est qu’elles ne permettent pas de diriger compl`etement les croisements entre individus. Il s’agit d’un probl`eme complexe et son ´etude a n´ecessit´e la construction d’un nouveau cadre de mod´elisation des croisements dirig´es entre individus. Nous avons travaill´e `a construire un tel cadre (Servin et al., soumis), et nous l’avons utilis´e pour d´eterminer la XXIII

meilleure strat´egie de cumul de 8 g`enes li´es dans un mˆeme g´enotype. Cette ´etude montre que diriger compl`etement les croisements entre individus permet de r´eduire `a la fois le temps et le coˆ ut n´ecessaire au cumul de ces 8 g`enes en comparaison avec la strat´egie MBRS (Hospital et al. 2000) d´ecrite dans cette introduction. Ce cadre th´eorique doit maintenant ˆetre enrichi. Deux pistes principales de recherche peuvent ˆetre envisag´ees pour l’am´eliorer. Tout d’abord, il serait n´ecessaire d’optimiser la recherche exhaustive de la meilleure strat´egie de cumul par am´elioration des algorithmes utilis´es dans la m´ethode. Ceci permettra d’augmenter le nombre de g`enes pour lequel un r´esultat optimal peut-ˆetre trouv´e. Ensuite, il faudrait utiliser les r´esultats obtenus par l’´enum´eration exhaustive de toutes les strat´egies possibles pour pouvoir identifier les r`egles g´en´erales d’optimisation du cumul de g`enes qui permettent de les expliquer. Ces deux pistes de recherche sont li´ees : si des r`egles g´en´erales d’optimisation sont connues, il est alors possible d’´ecrire des algorithmes de recherche beaucoup plus performants (Servin et al. a, in prep.).

R´ ef´ erences Charmet, G., N. Robert, M. Perretant, G. Gay, P. Sourdille, et al., 1999 Marker-assisted recurrent selection for cumulating additive and interactive QTLs in recombinant inbred lines. Theoretical and Applied Genetics 99 : 1143–1148. Crepieux, S., B. Servin, C. Lebreton, et G. Charmet, IBD-based QTL detection in multi-cross inbred-design : A case study of cereal breeding program. soumis `a Genetics . Gallais, A., 1990 Th´eorie de la s´election en am´elioration des plantes. Masson. George, A., P. Visscher, et C. Haley, 2000 Mapping quantitative trait loci in complex pedigrees : a two-step variance component approach. Genetics 156(4) : 2081–92. Gimelfarb, A. et R. Lande, 1994 Simulation of marker assisted selection in hybrid populations. Genet Res 63(1) : 39–47. Haley, C. S. et S. A. Knott, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 315–324. Hayes, B., P. Visscher, H. McPartlan, et M. Goddard, 2003 Novel multilocus measure of linkage disequilibrium to estimate past effective population size. Genome Res 13(4) : 635–43. Hoeschele, I., P. Uimari, F. Grignola, Q. Zhang, et K. Gage, 1997 Advances in statistical methods to map quantitative trait loci in outbred populations. Genetics 147(3) : 1445–57. Hospital, F., I. Goldringer, et S. Openshaw, 2000 Efficient marker-based recurrent selection for multiple quantitative trait loci. Genetical Research 75 : 357–368. Hospital, F., L. Moreau, F. Lacoudre, A. Charcosset, et A. Gallais, 1997 More on the efficiency of marker assisted selection. Theoretical and Applied Genetics 95 : 1181–1189. XXIV

Jansen, R., 1993 Interval mapping of multiple quantitative trait loci. Genetics 135(1) : 205–11. Jansen, R., J.-L. Jannink, et W. Beavis, 2003 Mapping quantitative trait loci in plant breeding populations : use of parental haplotype sharing. Crop Science 43. 829-834. Kao, C., Z. Zeng, et R. Teasdale, 1999 Multiple interval mapping for quantitative trait loci. Genetics 152(3) : 1203–16. Lande, R. et R. Thompson, 1990 Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3) : 743–56. Lander, E. et D. Botstein, 1989 Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121(1) : 185–99. Lange, C. et J. Whittaker, 2001 On prediction of genetic values in marker-assisted selection. Genetics 159(3) : 1375–81. Meuwissen, T. et M. Goddard, 2001 Prediction of identity by descent probabilities from marker-haplotypes. Genet Sel Evol 33(6) : 605–34. Meuwissen, T., B. Hayes, et M. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4) : 1819–29. Meuwissen, T., A. Karlsen, S. Lien, I. Olsaker, et M. Goddard, 2002 Fine mapping of a quantitative trait locus for twinning rate using combined linkage and linkage disequilibrium mapping. Genetics 161(1) : 373–9. Moreau, L., A. Charcosset, F. Hospital, et A. Gallais, 1998 Marker-assisted selection efficiency in populations of finite size. Genetics 148(3) : 1353–65. Sax, K., 1923 The association of size difference with seed-coat pattern and pigmentation in Phaseolus vulgaris. Genetics 8 : 552–560. Servin, B., Optimal background selection strategy to fullfill selection objectives n marker-assisted backcrossing. in prep. . Servin, B., C. Dillmann, G. Decoux, et F. Hospital, 2002 Mdm : a program to compute fully informative genotype frequencies in complex breeding schemes. Journal of Heredity 93(3) : 227–228. Servin, B. et F. Hospital, 2002 Optimal positioning of markers to control genetic background in marker-assisted backcrossing. Journal of Heredity 93(3) : 214–217. ´zard, et F. Hospital, a An optimal algorithm Servin, B., O. C. Martin, M. Me for the optimization of gene cascading. in prep. . ´zard, et F. Hospital, b Towards a theory of Servin, B., O. C. Martin, M. Me marker assisted gene pyramiding. soumis `a Genetics . Terwilliger, J. et K. Weiss, 1998 Linkage disequilibrium mapping of complex disease : fantasy or reality ? Curr Opin Biotechnol 9(6) : 578–94. van Berloo, R. et P. Stam, 1998 Marker-assisted selection in autogamous RIL populations : a simulation study. Theoretical and Applied Genetics 96 : 147–154. Weiss, K. et J. Terwilliger, 2000 How many diseases does it take to map a gene with SNPs ? Nat Genet 26(2) : 151–7. XXV

Whittaker, J., R. Curnow, C. Haley, et R. Thompson, 1995 Using markermaps in marker-assisted selection. Genet Res 66(3) : 255–65. Whittaker, J., C. Haley, et R. Thompson, 1997 Optimal Weighting of information in marker-assisted selection. Genetical Research 69 : 137–144. Xie, C., D. Gessler, et S. Xu, 1998 Combining different line crosses for mapping quantitative trait loci using the identical by descent-based variance component method. Genetics 149(2) : 1139–46. Zeng, Z., 1994 Precision mapping of quantitative trait loci. Genetics 136(4) : 1457– 1468.

XXVI

Deuxi` eme partie Calcul de fr´ equences de g´ enotypes multilocus dans des p´ edigrees complexes

XXVII

Pr´ esentation Les programmes informatiques MDM et grafgen L’analyse de sch´emas de construction de g´enotypes ou de populations en s´egr´egation produites pour d´etecter des QTL requiert de pouvoir calculer les probabilit´es que des parents transmettent `a leurs descendants un gam`ete de g´enotype donn´e (ou voulu) `a des locus cartographi´es. Ces probabilit´es peuvent ˆetre calcul´ees formellement (il est possible d’´ecrire les ´equations permettant de les calculer) dans le cas o` u (i) le nombre de locus consid´er´e est faible (typiquement deux), (ii) les g´enotypes des parents sont parfaitement connus et (iii) un seul mode de reproduction est utilis´e pour obtenir les descendants `a partir des parents. Par exemple, la plupart des programmes informatiques de d´etection de QTL ne sont adapt´es qu’`a l’analyse de plans de croisements relativement simples, typiquement des populations F2 , BC1 ou de lign´ees recombinantes. Cependant, dans la pratique, des plans de croisements plus complexes sont rencontr´es par exemple des populations de lign´ees hautement recombinantes ou des backcross suivis de g´en´erations d’autof´econdation pour fixer `a l’´etat homozygote les g`enes introgress´es. Dans certains plans de croisements, comme par exemple le backcross assist´e par marqueurs, les populations sont g´enotyp´ees `a chaque g´en´eration. Ainsi, la succession des g´enotypes des individus durant le plan de croisement sont connus. Ces g´enotypes apportent de l’information sur les ´ev`enements de recombinaisons ayant eu lieu au cours du plan de croisement et permettent de calculer avec plus de pr´ecision les probabilit´es de transmission all´eliques. Cependant, comme les calculs analytiques qu’il faut effectuer pour prendre ce type d’information en compte sont tr`es fastidieux, l’information sur les g´enotypes des ancˆetres aux g´en´erations pr´ec´edentes n’est g´en´eralement pas utilis´ee. Pour pouvoir prendre en compte toute l’information aux marqueurs disponible dans des pedigrees complexes, nous avons d´evelopp´e un programme informatique de calcul num´erique : MDM (Servin et al., 2002). MDM permet de prendre en compte une tr`es grande part de l’information apport´ee par le g´enotypage des individus du plan de croisement, tout en sachant s’accomoder de g´enotypes partiellement ou enti`erement inconnus. Une des principales applications de MDM est de permettre d’estimer avec une grande pr´ecision la composition g´en´etique d’un individu sur l’ensemble de son g´enome en se basant sur son g´enotype `a des marqueurs et ceux de ses ancˆetres. Cette information peut ˆetre utilis´ee pour calculer des mesures synth´etiques de la valeur d’un individu en s´election (comme par exemple, dans le cadre d’un programme de backcross, son taux de retour estim´e au parent r´ecurrent). Elle permet ´egalement de faire des estimations pr´ecises du g´enotype d’un individu dans des r´egions particuli`erement int´eressantes de son g´enome. Pour pouvoir analyser facilement les r´esultats obtenus grˆace aux calculs de MDM, il est pratique d’avoir recours `a une repr´esentation graphique du g´enotype d’un individu dans ces r´egions, et c’est pourquoi nous avons d´evelopp´e le programme GRAFGEN (Servin et Hospital, soumis). XXIX

Exemples d’application D´ etection de QTL : calcul de probabilit´ e d’identit´ e par descendance (Cr´ epieux et al., soumis) Pour pouvoir tester la pr´esence d’un QTL en n’importe quel point du g´enome, il est n´ecessaire de disposer de populations d’individus dont les ph´enotypes sont connus pour les caract`eres d’int´erˆet et de mettre en relation la valeur du ph´enotype avec le g´enotypes des individus aux marqueurs et en dehors des marqueurs. Pour ´eviter de devoir produire sp´ecifiquement des populations pour la d´etection de QTL, il serait int´eressant d’utiliser des populations d´ej`a disponibles, par exemple des populations issues de sch´emas de s´election ph´enotypique. Comme ces populations sont compos´ees d’individus issus d’un sch´ema de s´election, elles sont g´en´eralement de petite taille. Pour obtenir une puissance de d´etection suffisante, il faut alors consid´erer non pas une population mais un ensemble de populations interconnect´ees. Il est int´eressant dans ce cas de passer par le calcul de probabilit´es d’identit´e par descendance entre individus `a la position test´ee pour la pr´esence d’un QTL et d’effectuer ensuite une analyse statistique par maximum de vraisemblance de ´pieux et al. (disponible en Annexe) ont utilis´e le programme MDM mod`eles mixtes. Cre pour calculer les probabilit´es de transmission all´elique au sein de populations de lign´ees recombinantes. En utilisant ces probabilit´es et une mesure des relations de parent´e entre parents fondateurs de chacune des populations, il est alors possible de d´eterminer avec pr´ecision la probabilit´e que deux individus quelconques du dispositif soient identiques par descendance en un point du g´enome. Ceci permet finalement de d´etecter des QTL en prenant en compte au mieux l’information disponible sur l’ensemble du dispositif.

Analyse de populations produites par backcross assist´ e par marqueurs Les articles de Thabuis et al. (soumis) et de Lecomte et al. (soumis), disponibles en Annexe, pr´esentent deux programmes d’introgression de QTL assist´ee par marqueurs. Le programme GRAFGEN a ´et´e utilis´e dans ces ´etudes pour estimer pr´ecisemment les taux de retour au parent r´ecurrent des individus obtenus `a la fin du processus de s´election. GRAFGEN a ´egalement permis de produire les g´enotypes graphiques de ces individus permettant d’observer pr´ecisemment la r´epartition des all`eles donneurs toujours pr´esents dans leurs fonds g´en´etiques.

Position optimale des marqueurs pour contrˆ oler le fond g´ en´ etique dans des programmes de backcross assist´ e par marqueurs Le programme MDM a ´et´e utilis´e pour effectuer les calculs th´eoriques pr´esent´es dans les articles Servin et Hospital (2002) et Servin (soumis) inclus dans la troisi`eme partie de cette th`ese. Ces calculs permettent de d´eterminer les positions optimales des marqueurs pour contrˆoler le fonds g´en´etique dans des programmes de backcross assist´e par marqueurs. Le d´etail de ces calculs peut ˆetre trouv´e dans la troisi`eme partie de cette th`ese. XXX

Perspectives Les programmes GRAFGEN et MDM commencent `a ˆetre utilis´es par la communaut´e de recherche en am´elioration des plantes. D’apr`es les retours d’exp´erience des utilisateurs qui me sont connus, il semble que ces programmes permettent de r´epondre `a des questions scientifiques pertinentes de mani`ere originale. En dehors des exemples d’utilisation pr´esent´es dans cette th`ese, ces programmes ont ´et´e utilis´es, `a ma connaissance, pour – identifier simplement des individus recombinants autour de g`enes d’int´erˆet dans des populations produites par backcross (Marie Coque, INRA Ferme du Moulon). – estimer le taux de retour au parent r´ecurrent dans des populations produites par backcross (Agn`es Bouchez, INRA Mons) – visualiser les zones de distorsion de s´egr´egation dans des populations de cartographie g´en´etique (Fabien Chardon, INRA Ferme du Moulon). – visualiser les r´egions du g´enome qui ont ´et´e affect´ees par la s´election ph´enotypique (Marie Foulongne, INRA Bordeaux). A la fin de ma th`ese, je pense que l’on peut consid´erer que le d´eveloppement de ces programmes est achev´e. La m´ethodologie impl´ement´ee dans le programme MDM (principalement bas´ee sur les d´eveloppements th´eorique de Hospital et al., 1996) pourrait ˆetre utilis´ee dans un domaine plus large que ceux trait´es par le programme MDM et le programme GRAFGEN. Par exemple, cette m´ethodologie pourrait ˆetre adapt´ee pour permettre l’analyse de sch´emas de croisement encore plus g´en´eraux. En particulier il faudrait pouvoir traiter des sch´emas de croisement : – dans lesquels les parents fondateurs ne sont pas des individus homozygotes – au sein desquels les croisements entre paires d’individus de g´enotypes partiellement connus sont utilis´es – au sein desquels sont effectu´es des croisements entre individus de g´en´erations diff´erentes (pedigree comportant des boucles). Ceci permettrait de pouvoir adapter la m´ethodologie de MDM dans des pedigrees plus complexes et de la diffuser plus largement, en particulier au sein de la communaut´e de g´en´etique animale. En effet, dans les sch´emas de s´election des esp`eces animales majeures, les cas de figure cit´es ci-dessus sont fr´equemment rencontr´es. Ces d´eveloppements suppl´ementaires n´ecessitent cependant de d´evelopper de nouveaux programmes informatiques, l’architecture du programme MDM ne s’y pr´etant pas de fa¸con optimale.

XXXI

XXXII

䉷 2002 The American Genetic Association

References Danin-Poleg Y, Teis N, Baudracco-Arnas S, Pitrat M, Staub JE, Oliver M, Aru´s P, deVicente MC, and Katzir N, 2000. Simple sequence repeats in Cucumis mapping and map merging. Genome 43:963–974. Devos KM, Pittaway TS, Reynolds A, and Gale MD, 2000. Comparative mapping reveals a complex relationship between the pearl millet genome and those of foxtail millet and rice. Theor Appl Genet 100:190–198. Grant D, Cregan P, and Shoemaker RC, 2000. Genome organization in dicots: genome duplication in Arabidopsis and synteny between soybean and Arabidopsis. Proc Natl Acad Sci USA 97:4168–4173. Joobeur T, Periam N, deVicente MC, King G, and Aru´s P, 2000. Development of a second generation linkage map for almond using RAPD and SSR markers. Genome 43:649–655. Maaliepaard C, Alston F, Van Arkel G, Brown LM, Chevreau E, Dunemann G, Evans KM, Gardiner S, Guilford P, van Heusden AW, Janse J, Laurens F, Lynn JR, Manganaris AG, den Nijs APM, Periam N, Rikkenrink E, Roche P, Ryder C, Sansavini S, Schmidt H, Tartarini S, Verhaegh JJ, Vrielink-van Ginkel M, and King G, 1998. Aligning male and female linkage maps of apple (Malus pumilla Mill.) using multi-allelic markers. Theor Appl Genet 97:60–73. Received March 21, 2001 Accepted January 30, 2002 Corresponding Editor: Leif Andersson

MDM: A Program to Compute Fully Informative Genotype Frequencies in Complex Breeding Schemes B. Servin, C. Dillmann, G. Decoux, and F. Hospital In many genetics studies it is necessary to compute the expected frequencies of genotypes at marker loci and/or to infer the genotypes at chromosomal locations from the known genotypes at markers. This is the case, for example, for quantitative trait loci (QTL) detection, where likelihood ratio tests or multiple regressions are based on the probabilities of the different genotypes at a putative QTL location, given the genotypes at flanking markers. This is also the case for ‘‘graphical genotypes’’ ( Young and Tanksley 1989), where it is wanted to estimate the genomic composition of the chromosomes (parental origin of the alleles) given the genotypes at markers. The calculations performed by most existing programs (e.g., for QTL detection) are based solely on the genotypes at the two closest markers flanking the putative position on each side, observed at only one generation. This is sufficient only if the population considered has issued

from a single generation of effective recombination (e.g., BC1, F2) and if marker genotypes are known without ambiguity. If there is ambiguity in marker genotypes (e.g., dominance, missing data) or if more than one effective meiosis has taken place (e.g., F3, recombinant inbred lines (RILs), advanced backcross generation), then additional markers further from the putative position and/or marker genotypes at previous generations may also be informative and could permit more accurate prediction of the genotype at the putative position. Also, available programs generally consider fixed and reasonably simple breeding schemes (e.g., F2, F3, RIL, BCn ), whereas breeding schemes of higher (and arbitrary) complexity are more and more often used in practice (e.g., random mating before selfing to produce highly recombinant inbred lines ( HRILs) with higher apparent recombination rate, BC followed by selfing to fix the introgressions, etc.). MDM is a program that computes frequencies of multilocus genotypes in populations derived from breeding schemes involving any combination of selfing, fullsib mating, random mating, backcrossing, or hybrid mating that takes into account all the genotypic information available (flanking and nonflanking markers, intermediate generations). It can be used interactively to perform the relevant calculations on experimental data, or it can be included as a function in QTL detection programs or in simulation programs aimed at optimizing breeding schemes before proceeding to the experiments. More generally, MDM was designed for fast and easy numerical computation of multilocus genotype frequencies in arbitrary breeding schemes, avoiding cumbersome analytic derivations.

Principle The program works with a collection of loci (typically marker loci) described by their positions on a genetic map. Given a pedigree, the program computes the probabilities of the offspring genotypes at generation n. The pedigree is defined by the genotypes of the ancestors at each former generation (from generation 1 to n ⫺ 1), and by the mating systems used between generations. The breeding scheme (i.e., the succession of mating systems) can be any combination of backcrossing, hybrid mating, full-sib mating, or self-ing. Depending on the mating system, one or two ancestor genotypes are needed at each gen-

eration. A single ancestor ( herein called the maternal ancestor) is needed for selfing or full-sib mating. A second ancestor ( herein called the paternal ancestor) is needed for hybrid mating or backcrossing. In practice, the genotyping of an individual produces an observation (i.e., ‘‘phenotype’’) that poorly reflects its true genotype. Indeed, usually the marker phenotypes do not provide the gametic phase of the chromosomes (which allele originates from which parent), for example, in the case of a double or multiple heterozygote. Furthermore, genotyping data may not be fully informative, because of missing or incomplete data (e.g., in the case of dominant markers). So the program distinguishes between ‘‘observed genotypes’’ (OGs), allowing missing or incomplete genotyping data, and ‘‘true genotypes’’ ( TGs), where all alleles at all loci as well as the gametic phase are assumed to be known. Individuals are described by their OGs at all loci. The coding of the OGs and the relationship between OGs and TGs is user defined, allowing the user to work with any genotype-coding system used in particular experiments. According to the coding system, OGs at each generation are converted into all possible sets of corresponding TGs. Then, the probabilities of transition between all possible sets of TGs at different generations are computed according to the recursion equations of Hospital et al. (1996). Finally, these probabilities are summed to provide the probability that each offspring genotype at generation n issued from the ancestors in previous generations given the breeding scheme. An additional locus (typically a putative QTL position, or a point on a chromosome) can be included in the calculations. Thus the program computes two sets of genotypic frequencies: the frequencies of OGs at marker loci only, and the frequencies of OGs at marker loci plus the additional locus. This allows the user to compute the conditional probabilities of putative genotypes at the additional locus given the observed genotypes at marker loci. The maximal number of loci that can be considered simultaneously depends on computer memory size (e.g., taking seven loci into account requires 64Mb RAM). The number of ancestor generations is unbounded, but affects computing time (in conjunction with the number of loci).

Computer Notes 227

Running the Program A text format (ASCII) file is used to set the value of the parameters used for the computation: number of generations of the breeding scheme, number of offspring, number of genotypes to allocate to the additional locus in the offspring, and name of the file containing the coding system. It also contains the mating systems used in each generation. Finally, it contains the genetic map (chromosomes, names and positions of markers) along with the genotypes of the ancestors and the offspring at the marker loci. The genetic map used by MDM is constant during the whole breeding scheme. It is therefore assumed that recombination rates between loci are evaluated once (either on the population studied or on another one, e.g., if using a consensus or a joint map) and that they are not reestimated during the breeding scheme. It is possible to compose a large marker dataset only once, then run the program with different subsets of these markers on a given chromosome. This is particularly useful if the total number of loci on the chromosome is larger than MDM can handle. The computations performed by MDM can be customized by using options, such

228 The Journal of Heredity 2002:93(3)

as including an additional locus in the computation. MDM can be run several times for different positions of an additional locus. This can be used to perform a chromosome scan of the different offspring or to analyze a particular chromosome segment of interest (e.g., containing a QTL). In this case, it is possible to obtain the conditional probabilities of several genotypes at the additional locus, given the observed genotype at markers for each offspring. The output of the results can be either detailed, including recalling of the input parameters, or brief. This last option makes it easier for other programs to use the results provided by MDM. Another way to make MDM interact with other programs is to use its core computation function as a subroutine. For this, the source code of the MDM is split into two files, one containing the core computation function, one containing other functions used to manage input and output.

Package MDM is written in ANSI C and has been developed under a Linux/UNIX environment using the GNU C compiler (gcc). This compiler is included in all Linux and UNIX distributions. It is also freely available for Windows and DOS environments

(www.delorie.com/djgpp). It is therefore easy to compile MDM and to use it under other environments (e.g., Windows 9x and NT or DOS). The MDM package contains the source code and binaries of the program (including a Windows executable file) along with a user’s manual including the rules to write input files and examples. The package can be obtained free of charge by sending a blank DOS-formatted floppy disk to the corresponding author. The files are also freely available for downloading at http://moulon.inra.fr/⬃servin/mdm. From the Station de Ge´ne´tique Ve´ge´tale, INRA/UPS/INAPG, Ferme du Moulon, 91190 Gif sur Yvette, France. Address correspondence to Bertrand Servin at the address above or e-mail: [email protected]. 䉷 2002 The American Genetic Association

References Hospital F, Dillmann C, and Melchinger AE, 1996. A general algorithm to compute multilocus genotype frequencies under various mating systems. Comput Appl Biosci 12:455–462. Young ND and Tanksley SD, 1989. Restriction fragment length polymorphism maps and the concept of graphical genotypes. Theor Appl Genet 77:95–101. Received February 1, 2001 Accepted December 31, 2001 Corresponding Editor: Robert Angus

1

GRAFGEN

October 23, 2003

GRAFGEN : A program to design precision graphical genotypes Bertrand Servin and Fr´ed´eric Hospital Station de G´en´etique V´eg´etale INRA / CNRS / UPS / INAP-G 91190 Gif-sur-Yvette, France

ABSTRACT Summary: GRAFGEN is a tool for the analysis of complex breeding schemes with molecular markers. It produces numerical output of probabilities of allelic transmission through complex pedigrees for detailed knowledge of the genomic composition of a population and various graphical representations of the results. In particular, GRAFGEN designs precision graphical genotypes which extend the concept of graphical genotypes by interpolating the genotypes between markers while taking into account all possible recombinations given markers map and observed genotypes and pedigree data. Availability: GRAFGEN is a free software available at http://moulon.inra.fr/~ servin/grafgen. Contact:

[email protected]

In most plants and animals, the genotypes of individuals at molecular marker loci located on genetic maps can be assessed, leading to a discrete knowledge of the genomic composition of these individuals. Young and Tanksley (1989) introduced the concept of graphical genotypes in order to estimate and visualize the genomic composition of individuals between markers. Graphical genotypes is an helpful tool, for example, to screen populations for individuals carrying desired genotypes at genomic regions of interest, and/or spot favorable recombination events in markerassisted selection programs. However, the method proposed by Young and Tanksley to build graphical genotypes is an approximation and does not use all the information available. First, for schemes lasting more than one generation, the accumulation of crossovers over time can not always be ignored. Taking the number of meioses into account gives a better estimate of the genomic composition of individuals. Second, when available, taking account of the genotypes of the ancestors of the studied individuals in a pedigree can help to determine the most likely set of recombination events that led to the observed data. This set is not necessarily the one implying the fewest crossovers. Third, taking into account the genotypes at more than two markers flanking the region of interest in a multilocus analysis can help to assess more precisely the set of recombination events. Finally, more complex breeding schemes than F2 or backcrosses are commonly used nowadays such as backcross followed by selfing to fix introgressions, or alternation of random mating and selfing to produce Highly Recombinant Inbred Lines. The combination of mating systems used to produce the population strongly affect the rules of allelic transmissions from parents to their offspring, and hence must not be ignored. GRAFGEN was written in order to take into account all these informations for the estimation of the genomic composition of individuals, and to draw accordingly precision graphical genotypes. The principle is to compute the frequencies of all possible genotypes at equally spaced points (virtual loci) on the genome, given all pedigree information. From the results of these computations, GRAFGEN produces the precision graphical genotype of an individual. Different representations are possible (see Figure 1). The computation basis of GRAFGEN is the analytic equations derived by Hospital et al. (1996). These equations are implemented in the MDM program (Servin et al. , 2002) for numerical computations. These computations allow to estimate the genomic composition of an individual given (if available): i) its genotype at markers, ii) the genotypes of its ancestors, and iii) the breeding scheme from which it is derived. This breeding scheme is any combination of mating 1

2

GRAFGEN

October 23, 2003

Figure 1: Possible representations of Precision Graphical Genotypes - Example of an F3 population with two alleles segregating (noted 0 and 1). Grafgen represents for each individual either : (a) the probability of being of a given genotype (here the homozygote 1/1), or (b) the expected dose of a particular allele (here 1), or (c) the zones where the probabilities of given genotypes exceed a given threshold (here, the zones of probability > 0.8 are green for the heterozygote 0/1 , red for the homozygote 1/1, and blue for homozygote 0/0) ; Grafgen can also represent a synthetic “genotype” for the whole population, according to the mean allele frequency in the population (d). systems (hybrid mating, selfing, full-sib mating, random mating or doubled haploids). Marker genotypes can be either completely known (including coupling/repulsion phase), or partially known (e.g. for dominant markers), or completely unknown (missing data). Hence, the precision graphical genotypes produced by GRAFGEN take into account all recombination configurations consistant with the pedigree data while weighting them according to their probabilities of occurence. GRAFGEN is a program that produces image files on output (in jpeg or PNG format). It also produces a simple text file containing the results of its computations. GRAFGEN is written in ANSI C using the GD Library (http://www.boutell.com/gd). The source code and binaries for Linux and Windows are provided (see Availability). GRAFGEN is a free software published under the GNU General Public License (GPL). References Hospital F, Dillmann C, and Melchinger AE, 1996. A general algorithm to compute multilocus genotype frequencies under various mating systems. Comput Appl Biosci 12: 455–462. Servin B, Dillmann C, Decoux G and Hospital F, 2002, MDM : a program to compute fully informative genotype frequencies in complex breeding schemes. J. Hered., 93(3): 227–228. Young ND, and Tanksley SD, 1989. Restriction fragment length polymorphism maps and the concept of graphical genotypes. Theor Appl Genet 77: 95–101.

2

Troisi` eme partie Principes d’Optimisation du Backcross Assist´ e Par Marqueurs

XXXIII

Pr´ esentation Cette partie de ma th`ese est compos´ee d’un document de synth`ese sur les principes d’optimisation du backcross assist´e par marqueurs et de deux articles (Servin et Hopspital, 2002 ; Servin, soumis) portant sur l’optimisation de la s´election sur le fond g´en´etique. Le document ”Le backcross assist´e par marqueurs : principes d’optimisation” inclus dans cette partie de ma th`ese a ´et´e ´ecrit dans le but de regrouper les principaux r´esultats th´eoriques (comprenant les r´esultats pr´esent´es dans les deux articles inclus dans cette partie) sur le sujet et de pr´esenter au travers de simulations des exemples de leur application. Il n’´etait pas possible de couvrir l’ensemble des situations possibles dans ce document mais je pense qu’il permet de saisir la d´emarche qui me semble devoir ˆetre suivie lorsque l’on d´esire optimiser un programme de backcross assist´e par marqueurs. Cette d´emarche est illustr´ee `a travers l’optimisation de deux cas d’´etude de backcross assist´e par marqueurs repr´esentatifs de deux situations classiques d’introgression. Dans le premier cas, il s’agit d’introgresser un g`ene avec un objectif de taux de retour au parent r´ecurrent mod´er´e (type cultiv´e ) dans l’autre un objectif de taux de retour au parent r´ecurrent ´elev´e (type sauvage). Les d´evelopppements th´eoriques sur l’optimisation du backcross assist´e par marqueurs ont port´e sur l’ensemble de ses aspects (conservation des g`enes introgress´es, r´eduction de la taille du fragment entraˆın´e autour du g`ene et acc´el´eration du retour `a un fond g´en´etique receveur). Les derniers d´eveloppements th´eoriques sont relativement r´ecents (en particulier sur la r´eduction du segment entraˆın´e autour du g`ene et l’optimisation du nombre et de la position des marqueurs du fond g´en´etique). L’ensemble des r´esultats th´eoriques pr´esent´es dans cette partie n’a donc pas ´et´e appliqu´e enti`erement dans des programmes de backcross assist´e par marqueurs. Les deux articles de Thabuis et al. (soumis) et de Lecomte et al. (soumis) pr´esent´es en annexe `a cette th`ese sont deux exemples int´eressants d’optimisation du backcross assist´e par marqueurs utilisant les r´esultats th´eoriques disponibles au moment o` u ils ont ´et´e imagin´es. Il s’agit principalement de l’optimisation des marqueurs utilis´es pour contrˆoler les intervalles de confiance des QTL introgress´es et du calcul des tailles minimales de populations n´ecessaires pour introgresser ces intervalles. Des strat´egies de marquage originales pour r´eduire les coˆ uts de g´enotypage du programme d’introgression ont ´et´e ´egalement mises en œuvre dans ces exemples. Les r´esultats de la construction de g´enotypes dans ces articles ne sont cependant pas parfaits, en particulier en ce qui concerne la r´eduction de la taille des segments introgress´es autour des QTL. L’interpr´etation des r´esultats sur les valeurs ph´enotypiques des individus issus de ces programmes est donc difficile car les g`enes pr´esents dans les r´egions de g´enome donneur correspondant `a ces segments influencent certainement les valeurs ph´enotypiques des individus.

Perspectives Aujourd’hui, l’ensemble des principes d’optimisation du backcross assist´e par marqueurs sont connus, il est n´ecessaire de mettre au point des outils permettant de valoriser au mieux les r´esultats th´eoriques par leur diffusion `a la communaut´e des s´electionneurs. L’application XXXV

des r´esultats th´eoriques `a des programmes d’introgression pratiques demande une expertise que peuvent fournir des documents de synth`ese et des programmes informatiques adapt´es. Le document de synth`ese inclus dans cette partie a ´et´e ´ecrit dans le but de diffuser plus largement ces r´esultats th´eoriques en les regroupant et en les int´egrant dans une d´emarche d’optimisation globale. Il reste finalement aujourd’hui `a d´evelopper un programme informatique impl´ementant l’ensemble des principes d’optimisation pr´esent´es dans cette th`ese pour pouvoir les appliquer aux situations particuli`eres que rencontrent les s´electionneurs.

XXXVI

´ par marqueurs Le backcross assiste Principes d’optimisation

Table des mati` eres 1 Composition g´ en´ etique des individus issus d’une population de backcross 4 1.1 Fr´equences all`eliques au g`ene cible . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Composition g´en´etique des chromosomes non-porteurs . . . . . . . . . . . . 4 1.2.1 Esp´erance de la composition g´en´etique des chromosomes non-porteurs 4 1.2.2 Variance dans la composition g´en´etique des chromosomes non-porteurs 5 1.3 Composition g´en´etique d’un chromosome porteur . . . . . . . . . . . . . . 6 1.3.1 Taille du segment intact introgress´e autour du g`ene cible . . . . . . 6 1.3.2 Composition g´en´etique de l’ensemble d’un chromosome porteur . . 8 1.3.3 Conclusion partielle . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Utilisation de marqueurs pour la s´ election dans le cadre du backcross 2.1 D´emarche g´en´erale d’optimisation . . . . . . . . . . . . . . . . . . . . . . . 2.2 D´efinition de l’id´eotype aux marqueurs . . . . . . . . . . . . . . . . . . . . 2.2.1 Utilisation de marqueurs pour s´electionner les individus sur leur g´enotype au g`ene cible . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Utilisation de marqueurs pour r´eduire la taille du segment introgress´e autour du g`ene cible . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Utilisation de marqueurs pour acc´el´erer le retour au parent r´ecurrent sur les chromosomes non-porteurs . . . . . . . . . . . . . . . . . . . 2.2.4 Conclusion partielle . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Optimisation de la s´election pour obtenir l’id´eotype aux marqueurs `a moindre coˆ ut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Strat´egie de marquage optimale . . . . . . . . . . . . . . . . . . . . 2.3.2 Illustration de la d´emarche dans deux cas d’´etude . . . . . . . . . .

13 14 14 14 19 26 31 31 31 33

Introduction La backcross est une m´ethode d’am´elioration qui a pour but d’introduire un (ou plusieurs) g`ene(s) (appel´es g`enes cibles) dans un fond g´en´etique particulier. C’est pourquoi le backcross est ´egalement appel´e introgression de g`ene. Le principe du backcross (Figure 1) est de croiser un parent donneur du g`ene cible avec un parent receveur de ce g`ene. L’hybride ainsi produit est re-crois´e avec le parent receveur pour produire une population en s´egr´egation nomm´ee BC1 . Parmi les individus de la population BC1 , seuls ceux porteurs du g`ene cible sont s´electionn´es et recrois´es avec le parent receveur pour produire une nouvelle population (BC2 ). Ce cycle de s´election / recroisement est poursuivi pendant quelques g´en´erations (BC3 , BC4 , etc . . . ) de mani`ere `a obtenir un individu ne pr´esentant qu’un all`ele donneur au niveau du g`ene cible. En g´en´eral, l’introgression est achev´ee par une auto-f´econdation pour obtenir un individu homozygote donneur au niveau du g`ene cible. Comme le parent receveur intervient `a chaque g´en´eration, il est ´egalement appel´e parent r´ecurrent. Dans la suite de ce document nous utiliserons les deux termes. Le backcross est un exemple type de construction de g´enotypes. Avant mˆeme d’avoir effectu´e le premier croisement entre le parent donneur et le parent receveur, le g´enotype de l’individu qu’il faut obtenir `a la fin du programme est connu. Dans ce cadre, les marqueurs mol´eculaires sont un outil particuli`erement efficace. En effet, en utilisant des marqueurs mol´eculaires polymorphes entre le parent donneur et le parent receveur, il est possible de connaˆıtre la provenance g´en´etique des locus correspondant. Par ailleurs, en utilisant une carte g´en´etique appropri´ee, il est possible d’estimer la composition g´en´etique d’un individu `a partir de son g´enotype aux locus marqueurs. Finalement, l’utilisation de marqueurs mol´eculaires dans des programmes de backcross permet de s´electionner les individus directement sur leur ressemblance au g´enotype id´eal souhait´e. L’efficacit´e d’un programme de backcross assist´e par marqueurs se mesure par rapport `a celle d’un programme de backcross classique. Dans une premi`ere partie nous allons donc pr´esenter l’´evolution de populations produites par backcross classique. Nous aborderons ensuite les principes g´en´eraux sur l’utilisation des marqueurs mol´eculaires dans des programmes de backcross ainsi que son efficacit´e attendue par rapport au backcross classique. Finalement nous donnerons des lignes directrices pour l’´etablissement de strat´egies optimales d’introgression de g`enes assist´ee par marqueurs.

2

Parent Donneur

Parent Receveur fond génétique donneur fond génétique receveur gène cible

Hybide F1 = BC0

Parent Receveur

Population BC1

Parent Receveur

Population BC2

Fig. 1 – Principe du backcross 3

Chapitre 1 Composition g´ en´ etique des individus issus d’une population de backcross Au sein d’une population produite par backcross, les individus portent, pour chaque paire de chromosomes, un chromosome intact issu du parent receveur et un chromosome sur lequel s´egr`egent des all`eles donneurs et receveurs comme le montre la figure 1. Le g´enotype `a un locus quelconque d’un individu issu d’une population produite par backcross peut donc ˆetre soit h´et´erozygote donneur / receveur soit homozygote receveur / receveur (que mous noterons respectivement par la suite DR et RR ). Comme les all`eles donneurs sont port´es par le mˆeme chromosome, la phase de liaison entre all`eles de locus diff´erents est parfaitement connue. Ainsi, le chromosome issu du parent r´ecurrent est commun´ement omis lors de la repr´esentation graphique d’individus issus de programmes de backcross.

1.1

Fr´ equences all` eliques au g` ene cible

A chaque g´en´eration seuls les individus h´et´erozygotes donneur / receveur au g`ene cible sont s´electionn´es. La probabilit´e que ces individus transmettent l’all`ele donneur au g`ene cible est de 21 pour chacun de leurs descendants. Ainsi, apr`es recroisement, en esp´erance, la moit´e des descendants du croisement sera de g´enotype DR et l’autre moiti´e de g´enotype RR au g`ene cible. Au final, la fr´equence de l’all`ele donneur au g`ene cible dans une population produite par backcross est 41 .

1.2 1.2.1

Composition g´ en´ etique des chromosomes non-porteurs Esp´ erance de la composition g´ en´ etique des chromosomes non-porteurs

Les chromosomes non-porteurs sont les chromosomes qui ne portent pas de g`ene cible. Nous allons calculer les fr´equences des g´enotypes possibles sur les chromosomes non porteurs en absence de s´election. Nous appellerons f1 (t) la fr´equence du g´enotype DR `a un 4

locus d’un chromosome non-porteur et f2 (t) la fr´equence du g´enotype RR `a ce mˆeme locus. D’une g´en´eration de backcross `a la suivante on peut noter que : – Un locus de g´enotype h´et´erozygote a une probabilit´e de 1/2 de rester de g´enotype DR et une probabilit´e de 1/2 de passer au g´enotype RR – Un locus de g´enotype homozygote receveur restera de g´enotype receveur pendant le reste du programme de backcross. Ainsi, apr`es t g´en´erations de backcross, les fr´equences all`eliques `a un locus d’un chromosome non-porteur sont ( f1 (t) = (1/2)t (1.1) f2 (t) = 1 − (1/2)t En absence de s´election, les fr´equences all`eliques sont les mˆemes pour tous les locus d’un chromosome non porteur. Ainsi, la composition g´en´etique d’un chromosome non-porteur peut ˆetre calcul´ee en utilisant les ´equations 1.1 : la proportion totale d’all`eles donneur sur les chromosomes non porteurs d’un individu obtenu apr`es t g´en´erations de backcross est  1 (t+1) 1 × f1 (t) + 0 × f2 (t) = (1.2) 2 2 Par exemple, au bout de 6 g´en´erations de backcross, la proportion esp´er´ee de g´enome 7 donneur sur les chromosomes non-porteurs d’un individu 21 = 0, 0078. Donc, pour obtenir un individu pr´esentant en esp´erance moins de 1% de g´enome receveur sur ses chromosomes non-porteurs il faut effectuer au moins 6 g´en´erations de backcross.

1.2.2

Variance dans la composition g´ en´ etique des chromosomes non-porteurs

En suivant la m´ethode propos´ee par Hill (1993), il est possible de calculer la variance dans la composition g´en´etique (exprim´ee en proportion de g´enome donneur) des chromosomes non-porteurs. Si l’on d´efinit pour un locus i une variable Zi(t) indicatrice de la provenance du locus i `a la g´en´eration t, telle que Zi(t) = 0 si le locus i provient du parent receveur et Zi(t) = 1 s’il provient du parent donneur. Nous avons vu au paragraphe pr´ec´edent que E(Zi(t) ) = 1/2t . La proportion de g´enome donneur sur un chromosome est not´ee Z¯(t) . En consid´erant un grand nombre k de locus ´equidistant sur un chromosome, nous pouvons ´ecrire : k X ¯ Z(t) = 1/k Zi(t) (1.3) i=1

Nous pouvons alors calculer l’esp´erance de Z¯(t) comme : E(Z¯(t) ) = 1/k

k X i=1

= 1/2t 5

E(Zi(t) )

(1.4)

ce qui correspond `a la fr´equence f1 (t) de l’´equation 1.1. Nous pouvons ´egalement calculer la variance de Z¯(t) comme : k   X ¯ var(Z(t) ) = var 1/k Zi(t)

= 1/k 2

i=1 k k XX

E(Zi(t) , Zj(t) ) − E(Zi(t) )E(Zj(t) )

(1.5)

i=1 j=1

= 1/k

2

k X k X

E(Zi(t) , Zj(t) ) − (1/4)t

i=1 j=1

Notons que E(Zi(t) , Zj(t) ) est en fait la probabilit´e que deux locus i et j du chromosome proviennent tous les deux du parent donneur, nous avons  1 − r t ij E(Zi(t) Zj(t) ) = (1.6) 2 o` u rij est le taux de recombinaison entre les locus i et j. Si l’on consid`ere maintenant une distribution continue de locus i et j, de positions respectives zi et zj sur un chromosome de taille l, nous pouvons calculer E(Zi(t) , Zj(t) ) comme : Zl Zl  1 − rij t 2 dzi dzj (1.7) E(Zi(t) , Zj(t) ) = (1/l ) 2 0

0

Si les distances entre locus sont calcul´ees avec la fonction de Haldane (1919) (i.e. r(x1 , x2 ) = 1/2(1 − e−2|zi −zj | )), on peut int´egrer l’´equation 1.7 et calculer var(Z¯(t) ). On obtient : var(Z¯(t) ) = (1/2l )(1/4) 2

t

t   X t i=1

i

  −2l (1/i ) 2il − 1 + e − (1/4)t 2

(1.8)

1.3

Composition g´ en´ etique d’un chromosome porteur

1.3.1

Taille du segment intact introgress´ e autour du g` ene cible

Un chromosome porteur est un chromosome sur lequel se situe un g`ene cible. La s´election pour conserver ce g`ene `a l’´etat h´et´erozygote `a chaque g´en´eration a pour effet d’entraˆıner autour du g`ene tout un segment de chromosome intact n’ayant jamais subi de crossing-over depuis la g´en´eration BC0 (Figure 1.1). Hanson (1959) a calcul´e la taille moyenne de ce segment intact pour un g`ene cible se situant au centre d’un chromosome porteur. Sa d´emonstration consiste `a rechercher la fonction de distribution du crossing over le plus proche du g`ene cible au cours de t 6

dx

C y  



x

Lg

-

Ld

Fig. 1.1 – Fragment entraˆın´e autour d’un g`ene introgress´e (I) g´en´erations de backcross. En consid´erant un g`ene cible C de g´enotype DR au bout de t g´en´erations, on cherche `a calculer la probabilit´e que le crossing over le plus proche de C ayant eu lieu au cours des t g´en´erations de backcross ait eu lieu `a une distance x de C. Pour que cet ´ev`enement survienne, il faut – qu’aucun crossing over n’ait eu lieu dans l’intervalle de taille x. Si l’on consid`ere que la loi de distribution des positions des crossing over est une loi de Poisson, cette probabilit´e est de (e−x )t = e−tx – qu’au moins un crossing over ait eu lieu sur un intervalle infiniment petit ]x, x + dx] . La probabilit´e qu’un crossing over ait eu lieu dans cet intervalle est dx. Par ailleurs, la probabilit´e d’avoir plus d’un crossing over dans dx au cours des t g´en´erations est nulle. Au cours des t g´en´erations, il y a eu t possibilit´es de crossing over en x. La probabilit´e qu’il y ait eu crossing over dans l’intervalle dx au cours de t g´en´erations de backcross est donc t dx. Au final, la probabilit´e que le segment entraˆın´e autour du g`ene cible soit de longueur au plus ´egale `a x est donc Zx P (X ≤ x) =

te−tx dx

(1.9)

0

C’est `a dire que la fonction de distribution du crossing-over le plus proche de L est ft (x) = te−tx

(1.10)

La probabilit´e qu’il n’y ait eu aucun crossing over entre le g`ene cible et le t´elom`ere “gauche” apr`es t g´en´erations de backcross est e−tLg , o` u Lg est la longueur du segment de chromosome entre le g`ene cible et le t´elom`ere “gauche”. Dans ce cas, la longueur du segment introgress´e est Lg . Finalement, en consid´erant le cas o` u un crossing-over a r´eduit la taille du segment introgress´e et le cas o` u aucun crossing over n’est survenu sur le segment de chromosome, on peut calculer la taille moyenne du segment introgress´e d’un seul cˆot´e du g`ene comme : ZLg µSI (t, Lg ) =

1 xft (x)dx + Lg e−tLg = (1 − e−tLg ) t

(1.11)

0

Si l’on suppose que les ´ev`enements de crossing-over sont ind´ependants, les esp´erances des tailles des segments introgress´es de chaque cˆot´e du g`ene se calculent ind´ependamment avec 7

l’´equation 1.11. En consid´erant un g`ene cible plac´e `a une distance Ld du premier t´elom`ere du chromosome et Lg du second, la taille moyenne du segment intact entraˆın´e autour du g`ene est donc µSI (t, Lg ) + µSI (t, Ld ) c’est `a dire : 1 µSI (t, Lg ) + µSI (t, Ld ) = (2 − e−tLg − e−tLd ) t

(1.12)

De mani`ere similaire, il est possible de calculer la variance de la taille du segment introgress´e d’un seul cˆot´e du g`ene cible comme l’on montr´e Naveira et Barbadilla (1992) :

2 σSI (t, l) =

Zl

2 x2 f (x, t)dx − µSI (t, l)

(1.13)

0

 = (1/t2 ) 1 − e−tl (2tl + e−tl ) L`a encore, sous l’hypoth`ese d’´ev`enements de crossing-over ind´ependants, la variance de 2 2 (t, Ld ) (t, Lg ) + σSI la taille du segment introgress´e de part et d’autre du g`ene cible est σSI o` u Ld et Lg sont d´efinis comme pr´ec´edemment.

1.3.2

Composition g´ en´ etique de l’ensemble d’un chromosome porteur

En plus de ce segment intact retenu autour du g`ene cible, d’autres all`eles donneur continuent de s´egr´eger sur le chromosome porteur. Stam et Zeven (1981) ont quantifi´e la part de fond g´en´etique qui reste de g´enotype donneur sur l’ensemble d’un chromosome porteur. La d´emarche consiste `a calculer la proportion de fond g´en´etique donneur sur un chromosome conditionnellement au g´enotype DR au g`ene cible. En tout point X du chromosome on peut calculer la probabilit´e que X soit de g´enotype DR , sachant que le g`ene cible (C) est de g´enotype DR en constatant que la seule fa¸con possible d’observer X et I de g´enotypes DR `a la g´en´eration t est qu’il n’y ait pas eu de recombinaison entre X et I au cours des t g´en´erations. Il faut noter ici que nous travaillons sur ces probabilit´es de recombinaison et non plus sur des probabilit´es de crossing-over. La relation entre recombinaison et crossing-over est expliqu´ee dans l’encadr´e 1. On obtient :

P ({X = DR}|{I = DR}) =

=

P ({X = DR} ∩ {I = DR}) P ({I = DR}) (1/2)t (1 − rXI )t (1/2)t

= (1 − rXI )t 8

(1.14)

Fig. 1.2 – Taille cumul´ee du g´enome donneur Longueur cumul´ee moyenne de segments de g´enome issus du parent donneur qui s´egr`egent encore dans une population en fonction du nombre de g´en´erations de backcross effectu´ees (BCt ) et de la partie du g´enome consid´er´ee : sur les chromosomes non-porteurs (ligne pointill´ee), sur l’ensemble du chromosome porteur (ligne pleine) ou correspondant au segment intact introgress´e autour du g`ene cible (ligne tiret´ee). Le g´enome consid´er´e est constitu´e de 10 chromosomes non-porteurs et 1 chromosome porteur. Tous les chromosomes font 100 cM.

9

o` u rXI est le taux de recombinaison entre X et I. Si l’on appelle x la distance entre X et I, et en utilisant la fonction de Haldane (1919) reliant les distances g´en´etiques aux taux de recombinaison on obtient : gt (x) = (1 − rXI )t =

 1 + e−2x t

(1.15) 2 Pour obtenir la proportion attendue de g´enome donneur le long du chromosome, il faut int´egrer gt (x) sur l’ensemble du chromosome. La proportion de g´enome donneur sur ce chromosome est : 1 µtot (t, Lg , Ld ) = L

ZLg

ZLd gt (x)gx +

0

! gt (x)dx

0

(1.18) 1 1 µtot (t, L, l) = L 2

1.3.3

!t (

) t   1X t 1 1+ (1 − e−2kLg + 1 − e−2kLd ) L k=1 k 2k

Conclusion partielle

La figure 1.2 montre l’´evolution de la composition g´en´etique en terme de longueur de g´enome donneur sur un g´enome compos´e de 10 chromosomes non-porteurs et d’un chromosome porteur, chaque chromosome faisant 100 centiMorgans. Cette figure montre que la quantit´e de g´enome donneur sur les chromosomes non-porteurs diminue beaucoup plus rapidement que sur le chromosome porteur. La figure montre ´egalement que le g´enome donneur qui reste pr´esent sur le chromosome porteur est essentiellement contenu dans le segment intact introgress´e autour du g`ene cible. La taille de ce segment diminue tr`es lentement et il est donc possible d’observer des segments de grande taille entraˆın´es autour du g`ene mˆeme apr`es de nombreuses g´en´erations de backcross. Ceci a ´et´e confirm´e exp´erimentalement par une analyse a posteriori d’individus issus de programmes de backcross chez la tomate par Young et Tanksley (1989) comme le montre la figure 1.5. Les chromosomes porteurs repr´esent´es sont ceux d’individus issus de programmes de backcross effectu´es entre une esp`ece sauvage de tomate (Lycopersicon peruvianum, parent donneur) et l’esp`ece cultiv´ee de tomate (Lycopersicon esculentum, parent receveur). Le g`ene introgress´e dans tous les cas est un g`ene de r´esistance au virus de la mosa¨ıque du tabac (Tm-2). Cette figure montre bien que mˆeme apr`es de nombreuses g´en´erations de backcross, les segments introduits peuvent rester de grande taille. Par exemple, l’individu Craigella-Tm-2 obtenu apr`es 11 g´en´erations de backcross porte environ 2/3 de g´enome donneur sur son chromosome porteur.

10

Fig. 1.5 – Exemple de tailles de segments intacts (repr´esent´es en noir) dans diff´erents programmes de backcross chez la Tomate. Seuls les chromosomes porteurs des individus sont repr´esent´es. Le nombre de g´en´erations de backcross effectu´ees est indiqu´e `a droite (le signe “ ?” indique que ce nombre n’est pas connu).

11

Encadr´ e 1 Crossing-over et Recombinaison Un crossing-over est un ´ev`enement chromosomique survenant au cours de la m´eiose. Lors de l’appariement des paires de chromosomes en t´etrades, des points de fixation entre les chromatides soeurs se produisent. Ces points sont appell´es chiasmas. Au niveau des chiasmas, des cassures des chromatides peuvent se produire et provoquer un ´echange de mat´eriel g´en´etique entre chromatides soeurs comme le montre la figure 1.3 On dit alors qu’il y a eu un crossing-over au niveau du chiasma. ´ enement de crossing La recombinaison entre locus est une cons´equence Fig. 1.3 – Ev` g´en´etique des crossing over, qui peut ˆetre observ´ee over au cours de la m´ eiose apr`es une m´eiose. Consid`erons un individu diplo¨ıde de g´enotype a1 /a2 `a un locus A et b1 /b2 `a un locus B comme repr´esent´e sur la figure 1.4. Au sein de cet individu les all`eles a1 et b1 sont sur le mˆeme chromosome (on dit qu’ils sont en phase coupl´ee), de mˆeme pour les all`eles a2 et b2 . Apr`es la m´eiose, cet individu peut produire des gam`etes portant : – les couples d’all`eles (a1 ,b1 ) et (a2 ,b2 ). Dans ces gam`etes la phase est identique `a celle de l’individu diplo¨ıde. Ils sont appell´es gam`etes parentaux. – les couples d’all`eles (a1 ,b2 ) et (a2 ,b1 ). Dans ces gam`etes la phase est diff´erente de celle de l’individu diplo¨ıde. Ils sont issus d’une recombinaison entre les locus A et B `a la m´e¨ıose et sont appell´es gam`etes recombin´es. La proportion de gam`etes recombin´es parmis tous les gam`etes produits par un individu est le taux de recombinaison entre les locus A et B (rAB ). Il y a recombinaison entre deux locus lorsque un nombre impair de crossing over a lieu dans le segment de chromosome compris entre ces deux locus. Ainsi, le taux de recombinaison entre deux locus est la probabilit´e qu’il y ait eu un nombre impair de crossing-over entre ces deux locus. En supposant ind´ependants les ´ev`enements de crossing-over, si l’on consid`ere une variable al´eatoire X indiquant le nombre de crossing over entre les locus A et B, X suit une loi de Poisson d’esp´erance ´egale ` a la Fig. 1.4 – Gam´etes parentaux et recom- distance g´en´etique entre le locus A et le locus B bin´es apr`es la m´eiose chez un individu di- (dAB ). Cette distance est appell´ee distance de Halplo¨ıde dane (Haldane 1919) . La probabilit´e qu’il y ait eu k crossing-over dans le segment de chromosome entre A et B est : P (X = k) = e−dAB ((dAB )k /k!) (1.16) La probabilit´e qu’il y ait eu un nombre impair de crossing over (i.e qu’il y ait eu recombinaison) entre les locus A et B est : rAB =

∞ X

P (X = 2k + 1)

soit

k=0

12

rAB = 1/2(1 − e−2dAB )

(1.17)

Chapitre 2 Utilisation de marqueurs pour la s´ election dans le cadre du backcross Comme nous l’avons vu dans la section pr´ec´edente, l’objectif de la s´election dans un programme de backcross se r´esume `a l’obtention d’un individu de g´enotype id´eal (id´eotype), compl`etement d´etermin´e par les parents initiaux. Cet id´eotype ne porte que des all`eles receveur dans son fond g´en´etique et un g`ene cible `a l’´etat h´et´erozygote, si l’on fait abstraction de la derni`ere g´en´eration d’auto-f´econdation. Comme nous allons le voir, l’utilisation de marqueurs peut permettre de faciliter et d’accroˆıtre l’efficacit´e de la s´election au sein de populations produites par backcross. Dans le cadre du backcross classique que nous avons d´ecrit ci-dessus, les ph´enotypes des individus sont utilis´es pour identifier les individus h´et´erozygotes au g`ene cible qui sont recrois´es au parent r´ecurrent. Cette s´election ph´enotypique peut poser des probl`emes en particulier pour des g`enes cibles r´ecessifs. Dans ce cas, seule une ´evaluation sur descendance des individus permet d’identifier les individus h´et´erozygotes au g`ene cible, ce qui rend l’introgression de g`ene r´ecessif quasiment impraticable dans ces conditions. Cette s´election est ´egalement difficile lorsque le g`ene cible est un g`ene impliqu´e dans un caract`ere quantitatif (QTL), en particulier pour des QTL d’effets faibles. Les marqueurs peuvent ˆetre utilis´es pour identifier les individus h´et´erozygotes au niveau du g`ene cible dans la population, quelque soit le type de g`ene cible. L’utilisation de marqueurs du g`ene permet de le traiter comme un g`ene majeur dominant, ce qui permet de pallier les probl`emes mentionn´es ci-dessus (voir le paragraphe 2.2.1 ci-dessous). Par ailleurs, dans le cadre du backcross classique, il n’est pas possible de quantifier la proportion d’all`eles qui demeurent de g´enotype donneur dans le fond g´en´etique des individus. Ainsi, les individus ne peuvent pas ˆetre s´electionn´es sur la base de leur fond g´en´etique. Il est alors n´ecessaire de pratiquer un grand nombre de g´en´erations pour obtenir l’id´eotype comme le montre le nombre de backcross effectu´es pour obtenir les individus de la figure 1.5. L’utilisation de marqueurs distribu´es r´eguli`erement sur le g´enome permet d’obtenir une estimation de la quantit´e de g´enome donneur pr´esent dans le fond g´en´etique d’un individu. En particulier, il est possible d’identifier avec des marqueurs bien plac´es des 13

individus pr´esentant (i) des recombinaisons proches du g`ene cible et qui pr´esentent donc des segments introgress´es de petite taille et/ou (ii) une faible proportion de g`enes donneurs sur les chromosomes non-porteurs (voir les paragraphes 2.2.2 et 2.2.3 ci-dessous). De nombreuses ´etudes th´eoriques ont ´et´e men´ees pour optimiser l’utilisation de marqueurs pour faciliter la s´election dans des populations produites par backcross. Dans cette partie, nous allons pr´esenter les r´esultats les plus significatifs de ces ´etudes.

2.1

D´ emarche g´ en´ erale d’optimisation

Il est possible de d´efinir une d´emarche g´en´erale d’optimisation d’un programme de backcross assist´e par marqueurs en la d´ecomposant en trois ´etape : D´ efinition des objectifs d’introgression La premi`ere ´etape de tout backcross assist´e par marqueurs estla d´efinition des objectifs de l’introgression. Typiquement : – quel est le taux minimum de retour au parent r´ecurrent que l’individu final doit pr´esenter – qu’elle est la taille maximale acceptable du segment introgress´e autour du g`ene. D´ efinition de l’id´ eotype aux marqueurs Une fois ces crit`eres d´efinis, il est possible de dresser le portrait d’un id´eotype aux marqueurs. En partant des objectifs de l’introgression, il est possible de choisir le nombre et la position des marqueurs `a s´electionner pour esp´erer ˆetre le plus proche possible de l’id´eotype sur l’ensemble du g´enome. La recherche des marqueurs `a utiliser fait alors typiquement appel `a des calculs th´eoriques permettant d’estimer la vraie composition g´en´etique d’un individu a` partir de son g´enotype aux marqueurs. Les principaux calculs permettant ces estimations sont pr´esent´es dans la section 2.2. Recherche d’une solution optimale pour obtenir l’id´ eotype aux marqueurs Une fois l’id´eotype aux marqueurs d´efini, il convient de d´eterminer la strat´egie `a appliquer pour obtenir ce g´enotype `a moindre coˆ ut et/ou le plus rapidement possible. La recherche de la meilleure strat´egie d’introgression est typiquement faite par simulations car l’espace des param`etres influen¸cant les coˆ uts et / ou la dur´ee d’un programme de backcross assist´e par marqueurs est trop grand pour ˆetre envisager de mani`ere exhaustive. Nous ´etudierons dans la section 2.3 comment d´eterminer la meilleure strat´egie d’introgression dans des cas repr´esentatifs de situations r´eelles.

2.2 2.2.1

D´ efinition de l’id´ eotype aux marqueurs Utilisation de marqueurs pour s´ electionner les individus sur leur g´ enotype au g` ene cible

L’utilisation de marqueurs pour contrˆoler le g`ene cible est une des premi`eres utilisations de marqueurs en s´election dans des programmes de backcross. Les premi`eres ´etudes 14

Marqueur hétérozygote Marqueur homozygote receveur

Gène Cible Segment Intact

Chromosomes non-porteurs

Chromosome Porteur

Fig. 2.1 – Repr´esentation sch´ematique d’un individu issu d’une population de backcross th´eoriques men´ees sur ce sujet ont cherch´e `a estimer la probabilit´e de contrˆole du g`ene par des marqueurs (Melchinger 1990). Si l’on consid`ere m marqueurs localis´es au voisinage du g`ene cible, le principe est de d´eterminer la probabilit´e qu’un individu porte le g`ene `a l’´etat h´et´erozygote sachant qu’il pr´esente des g´enotypes h´et´erozygotes `a l’ensemble de ces m marqueurs. On notera Pgc cette probabilit´e. Dans la suite de cette section, nous allons voir comment calculer Pgc en fonction du nombre de marqueurs et de leurs positions par rapport au g`ene cible. De mani`ere g´en´erale, on peut exprimer Pgc comme :   P {M = DR} ∩ {G = DR}   (2.1) Pgc = P {M = DR} o` u M est mis pour l’ensemble des marqueurs, {M = D} est l’´ev`enement “le(s) marqueur(s) est (sont) de g´enotype h´et´erozygote donneur / receveur ” et {G = D} est l’´ev`enement “le g`ene cible est de g´enotype h´et´erozygote donneur / receveur ”. 2.2.1.1

Cas d’un g` ene de position connue

Cas d’un marqueur intrag´ enique Si un marqueur intrag´enique est disponible, il est possible d’acc´eder directement au g´enotype des individus pour le g`ene cible. Dans ce cas nous avons ´evidemment Pgc = 1 (2.2)

15

Cas d’un seul marqueur proche du g` ene Il n’est pas toujours possible de disposer d’un marqueur intrag`enique. Il est cependant parfois possible de connaˆıtre son taux de recombinaison avec un marqueur proche. On peut alors calculer la probabilit´e Pgc en fonction de ce taux la distance entre le g`ene et le marqueur : Pgc = (1 − rCM )

(2.3)

o` u rCM est le taux de recombinaison entre le g`ene cible et le marqueur. On peut remarquer que cette probabilit´e vaut 1 quant r = 0, ce qui correspond au cas o` u le marqueur est 1 intrag´enique, et 2 quand r = 1/2, c’est `a dire quand le marqueur et le g`ene ne sont pas li´es. Dans ce dernier cas, le g´enotype au marqueur ne fournit pas d’information sur le g´enotype du g`ene cible et la probabilit´e de conserver le g`ene `a l’´etat h´et´erozygote donneur / receveur est ´egale `a 21 comme pour tout autre locus sur le g´enome. L’´evolution de cette probabilit´e au cours de g´en´erations de backcross est pr´esent´ee sur la figure 2.2. Cas de deux marqueurs de part et d’autre du g` ene Si l’on dispose de deux marqueurs situ´es de part et d’autre du g`ene sur la carte, il est possible de contrˆoler le g`ene en conservant au cours de la s´election ces deux marqueurs `a l’´etat h´et´erozygote donneur / receveur. Si l’on reprend la formule 2.1, {M = DR} est l’´ev`enement “les deux marqueurs sont de g´enotype h´et´erozygote donneur / receveur ”. La probabilit´e Pgc est : Pgc =

(1 − rMg C )(1 − rMd C ) (1 − rMg Md )

(2.4)

o` u rMg C , rMd C et rMg Md sont respectivement les taux de recombinaison entre le marqueur gauche et le g`ene cible, le marqueur droit et le g`ene cible et les deux marqueurs. En supposant qu’il n’y a pas d’interf´erence entre les ´ev`enements de crossing-over, on obtient : Pgc =

1 (1 2

+ e−2dg )(1 + e−2dd ) 1 + e−2(dg +dd )

(2.5)

o` u dg est la distance g´en´etique entre le marqueur “gauche” et le g`ene, dd la distance entre le marqueur “droit” et le g`ene, calcul´ees avec la fonction de Haldane (1919) (cf. encadr´e 1). La figure 2.2 montre l’´evolution de la probabilit´e de conserver le g`ene en s´electionnant les individus sur un (cercles pleins) ou 2 (triangles) marqueurs. Cette figure montre qu’`a distance au g`ene identique (5 centiMorgans ici), la probabilit´e de contrˆoler le g`ene cible avec deux marqueurs est beaucoup plus importante qu’en utilisant un seul marqueur. De plus, les probabilit´es de conserver le g`ene avec deux marqueurs plus ´eloign´es (10 et 20 centiMorgans sur la figure) sont plus ´elev´ees qu’en utilisant un seul marqueur proche du g`ene. Ainsi, il est pr´ef´erable d’utiliser deux marqueurs “loins” du g`ene cible qu’un seul marqueur “proche”.

16

Fig. 2.2 – Probabilit´e de contrˆole d’un g`ene cible cartographi´e avec 1 marqueur (cercles pleins) ou 2 marqueurs (triangles vides) au cours de 20 g´en´erations de backcross. Le contrˆole du g`ene par un marqueur est illustr´e pour une distance de 5 cM (d). Le contrˆole du g`ene cible par deux marqueurs ´equidistants du g`ene cible est illustr´e pour des distances de 5cM, 10 cM et 20 cM.

17

Fig. 2.3 – Distribution (C) de la vraie position d’un QTL autour de la position estim´ee (λ). L’intervalle de confiance autour de la position est repr´esent´e par une boite rouge

Conclusion Il est ´evidemment id´eal de poss´eder un marqueur intrag´enique pour conserver le g`ene. Ce cas est cependant relativement rare et correspond la plupart du temps au cas o` u le g`ene cible est un transg`ene que l’on d´esire introduire dans un nouveau fond g´en´etique. 2.2.1.2

Cas particulier du contrˆ ole d’un QTL

Un QTL est un locus particulier `a contrˆoler par des marqueurs car la position r´eelle du g`ene cible n’est pas connue. On dispose en g´en´eral d’une position estim´ee (que nous noterons λ) et d’un intervalle de confiance pour cette position. On peut interpr´eter cet intervalle de confiance comme ´etant la zone dans laquelle la vraie position du g`ene a une probabilit´e α de se trouver. On peut alors supposer que la vraie position du QTL suit une distribution normale (courbe bleue repr´esent´ee sur la figure 2.3), centr´ee sur la position λ et dont l’´ecart type (σ) est d´etermin´e par la longueur de l’intervalle de confiance. En effet, comme l’ont sugg´er´e Hospital et Charcosset (1997) , si l’on note xinf et xsup les bornes de l’intervalle de confiance, on peut calculer σ a posteriori comme solution de l’´equation xsup Z φ[x, λ, σ]dx = 1 − α xinf

o` u φ[x, λ, σ] est la fonction de densit´e de la loi normale de moyenne λ et d’´ecart type σ. Il est possible de calculer la probabilit´e de conserver un QTL cible `a l’aide d’un ou plusieurs marqueurs en appliquant les calculs pr´ec´edents (Pgc , ´equations (2.1) et (2.5)) `a chaque position possible du QTl et en int´egrant le long du chromosome : ZL PQT L =

Pgc [λ, x] φ[x, λ, σ]dx 0

18

(2.6)

CI = 10 cM

CI = 40 cM

P(Q|M)

N

P(Q|M)

1.000

8

0.998

11

1.000

8

0.997

11

1.000

8

0.995

10

0.999

8

0.987

9

0.985

7

0.944

7

P(Q|M)

N

P(Q|M)

1.000

9

0.997

13

0.999

9

0.995

13

0.998

9

0.990

11

0.996

8

0.978

10

0.970

7

0.919

7

CI = 20 cM

N

CI = 60 cM N

Fig. 2.4 – Positions optimales des marqueurs pour contrˆoler un QTL et les probabilit´es de contrˆole associ´ees en fonction de la pr´ecision sur la position estim´ee du QTL (en taille d’intervalle de confiance). Les triangles noirs indiquent la position estim´ee du QTL. Les sections de chromosomes repr´esent´ees sous forme de boite correspondent ` a l’intervalle de confiance sur la position estim´ee du QTL. Les chiffres indiqu´es `a gauche (P (Q|M )) correspondent ` a la probabilit´e de contrˆole du QTL (appel´ee PQT L dans le texte), les chiffres indiqu´e ` a droite (N ) sont le nombre de g´en´erations de backcross effectu´ees. D’apr`es Hospital et Charcosset (1997)

o` u PQT L est la probabilit´e de contrˆole du QTL, qui est l’´equivalent de Pgc pour un g`ene cible de position inconnue (QTL). Visscher et al. (1996) ont utilis´e l’´equation 2.6 pour calculer les probabilit´es de contrˆole d’un QTL en utilisant 1 ou 2 marqueurs. Hospital et Charcosset (1997) ont ´etendu ces int´egrations pour calculer les probabilit´es de contrˆole en utilisant un nombre quelconque de marqueurs. Ces ´equations permettent de calculer, connaissant les positions des marqueurs que l’on d´esire utiliser, la probabilit´e de conserver l’all`ele donneur au QTL introgress´e ou de d´eterminer quels sont le nombre et les positions des marqueurs qui permettent de contrˆoler le QTL de mani`ere optimale. La figure 2.4 montre des exemples de positions optimales de marqueurs pour contrˆoler un QTL, en fonction de la taille de l’intervalle de confiance. De mani`ere g´en´erale, pour les tailles d’intervalles de confiance classiquement rencontr´ees, 2 ou 3 marqueurs dans l’intervalle sont suffisants.

2.2.2

Utilisation de marqueurs pour r´ eduire la taille du segment introgress´ e autour du g` ene cible

La s´election des h´et´erozygotes au g`ene cible conduit `a introgresser un segment de g´enome donneur autour du g`ene (´egalement appel´e ph´enom`ene d’auto-stop). Nous avons vu dans le paragraphe 1.3.1 que la longueur moyenne de ce segment peut ˆetre grande. Il est possible d’utiliser des marqueurs pour s´electionner les individus, h´et´erozygotes au g`ene 19

cible, chez qui des crossing-over sont survenus qui ont coup´e ce segment. En effet, en utilisant des marqueurs de chaque cˆot´e du g`ene, il est possible d’observer les recombinaisons entre ces marqueurs et le g`ene cible (figure 2.5), provoqu´ees par des crossing-over proches du g`ene ; les individus porteurs de g´enotypes recombin´es seront s´electionn´es. Les questions qui se posent alors sont (i) o` u placer ces marqueurs, (ii) combien de g´en´erations sont n´ecessaires pour obtenir les recombinaisons de part et d’autre du g`ene. Nous allons voir dans cette partie comment r´epondre `a ces questions. Si le g`ene cible est contrˆol´e par deux marqueurs flanquant(Mg et Md ), le segment intact de g´enome donneur retenu autour du g`ene cible est compos´e du segment de chromosome entre Mg et Md contenant le g`ene cible et des segments entraˆın´es `a l’ext´erieurs des marqueurs Mg et Md , comme le montre la figure 2.5. Nous allons ici calculer la taille des parties du segment ext´erieures `a Mg et Md . Nous noterons r1 le taux de recombinaison entre M 1 et Mg et r2 le taux de recombinaison entre Md et M 2. r1 M1

Mg C M d

r2 M2

~ 

-

segment de g´enome donneur intact Fig. 2.5 – Utilisation de marqueurs pour r´eduire la taille du segment introgress´e. Un marqueur est dispos´e de chaque cˆ ot´e du g`ene cible (C). Les marqueurs sont nomm´es M 1 et M 2 sur la figure. Le g`ene cible peut ˆetre contrˆ ol´e par des marqueurs (nomm´es Mg et Md sur la figure), qui sont de g´enotype DR . Le g´enotype id´eal aux marqueurs est repr´esent´e ici. Les deux marqueurs sont de g´enotypes RR , le g`ene cible et les marqueurs le contrˆ olant de g´enotype DR . La taille du segment intact retenu autour du g`ene est alors n´ecessairement r´eduite par rapport a` la taille attendue sans s´election (voir les d´etails dans le texte)

2.2.2.1

Relation entre distance g` ene - marqueur et taille du segment introgress´ e

Hospital (2001) a d´eriv´e les ´equations permettant de calculer la taille esp´er´ee du segment introgress´e lorsque des marqueurs proches du g`ene sont utilis´es pour r´eduire la taille du segment introgress´e. Son approche est similaire `a celle de Naveira et Barbadilla (1992) que nous avons d´ecrite dans le paragraphe 1.3.1 mais permet de prendre en compte l’information donn´ee par les marqueurs. Nous nous pla¸cons dans la configuration de la figure 2.5. Nous allons montrer comment calculer la taille moyenne du segment introgress´e entre le le marqueur M 1 et le marqueur Mg , quand Mg est de g´enotype RR . Notons d`es `a pr´esent que le calcul de la taille moyenne du segment introgress´e entre M 2 et Md est ´equivalent. Nous voulons calculer la loi de distribution de la taille du segment `a une g´en´eration t sachant que le marqueur Mg est de g´enotype RR `a la g´en´eration t.

20

La probabilit´e que le marqueur M soit de g´enotype RR `a la g´en´eration t est : 1 − (1 − r1 )t

(2.7)

Consid´erons maintenant un locus X du segment de chromosome compris entre M 1 et Mg , distant de x Morgans de Mg . Nous voulons calculer la probabilit´e que la portion de segment intact comprise entre M 1 et Mg soit de longueur x. Pour que ce soit le cas, il faut 1. qu’`a une g´en´eration donn´ee, disons tCO , un crossing over ait eu lieu en x et qu’aucun crossing over ne soit venu couper le segment entre T et X au cours des t g´en´erations. Nous avons vu comment calculer la probabilit´e de cet ´ev`enement dans le paragraphe 1.3.1, elle est ´egale `a e−tx dx (2.8) 2. que conjointement, le marqueur M 1 soit de g´enotype RR . Consid´erons d’abord l’´ev`enement inverse : M 1 est de g´enotype DR . – Jusqu’`a la g´en´eration tCO , X est aussi de g´enotype DR, la probabilit´e que M 1 et X soient tous deux de g´enotypes DR est donc (1 − r1 )tCO . ` la g´en´eration tCO , X est de g´enotpye RR . Pour que M 1 soit de g´enotype DR , – A il faut une recombinaison entre X et M 1, la probabilit´e de cette ´ev`enement est r1 . – `a partir de la g´en´eration tCO + 1 et jusqu’`a t, X est de g´enotype RR et M 1 de g´enotype DR . La probabilit´e de cet ´ev`enement est donc (1 − r1 )t−(tCO +1) . Au final, la probabilit´e que le marqueur M 1 soit de g´enotype DR et que le crossingover le plus proche de Mg ait eu lieu en X est : (1 − r1 )tCO × r1 × (1 − r1 )(t−(tCO +1)) = r1 (1 − r1 )(t−1)

(2.9)

La probabilit´e que nous recherchons est celle de l’´ev`enement inverse, c’est `a dire : 1 − r1 (1 − r1 )(t−1)

(2.10)

Finalement, nous pouvons calculer la probabilit´e, sachant que M 1 est de g´enotype RR , que la longueur de la portion de segment intact comprise entre M 1 et Mg soit au plus ´egale `a x comme : Zx   t 1 − rXM (1 − rXM )(t−1) −tx Pt (X ≤ x) = e dx (2.11) 1 1 − (1 − rT M )t 0

Si l’on suppose qu’il n’y a pas d’interf´erence dans les crossing-over, le taux de recombinaison r1 entre Mg et M 1 peut ˆetre calcul´e `a partir de leur distance g´en´etique (d1 ) avec la formule de Hladane, on obtient alors : Zx Pt (X ≤ x) =

2t − (1 − e−2(d1 −x) )(1 + e−2(d1 −x) )(t−1) −tx te dx 2t − (1 + e−2d1 )t

(2.12)

0

C’est `a dire que la fonction de distribution du crossing-over le plus proche de Mg est : 21

2t − (1 − e−2(d1 −x) )(1 + e−2(d1 −x) )(t−1) −tx fM (t, x) = te 2t − (1 + e−2d1 )t

(2.13)

Il est alors possible de calculer la moyenne et la variance de la longueur de la taille de la portion du segment introgress´e entre M 1 et Mg , sachant que le marqueur M 1 est de g´enotype RR : Zd1 µSI|M (t, d1 ) =

xfM (t, x)dx

(2.14)

x2 fM (t, x)dx − (µSI|M (t, l))2

(2.15)

0 2 σSI|M (t, d1 ) =

Zd1 0

Une forme explicite de µSI|M (t, l), d´eriv´ee en int´egrant l’´equation 2.14, peut-ˆetre trouv´ee dans Hospital (2001). De mani`ere pratique, de nombreux logiciels de calcul num´erique peuvent ˆetre utilis´es pour trouver la solution des ´equations 2.14 et 2.15. En consid´erant les marqueurs Mg et Md , la moyenne de la taille du segment entraˆın´e autour du g`ene cible est µSI|M (t, d1 ) + µSI|M (t, d2 ) + dMg Md o` u d2 est la distance g´en´etique entre Md et M 2 et dMg Md la distance g´en´etique entre Mg et Md . La figure 2.6 montre l’esp´erance de la longueur du segment introgress´e entre Mg et M 1 pour diff´erentes g´en´erations de backcross, en fonction de d1 . Cette figure montre que l’accumulation de m´eioses ne permet pas de r´eduire la taille du segment introgress´e de mani`ere significative pour des distances g`ene - marqueur faibles. De mani`ere pratique, si l’on veut r´eduire drastiquement la taille du segment introgress´e, il faut utiliser des marqueurs tr`es proches du g`ene. Dans ce cas (petites distances g`ene - marqueur sur la figure 2.6), l’esp´erance de la taille du segment introgress´e est environ la moiti´e de la distance g`ene marqueur, quelque soit la g´en´eration `a laquelle le marqueur devient de g´enotype receveur. Pour introduire des segments de petite taille, la seule solution consiste donc `a utiliser des marqueurs tr`es proches du g`ene cible, et `a rechercher le plsus tˆot possible des individus pr´esentant un g´enotype RR `a l’un ou l’autre des marqueurs flanquant le g`ene cible. 2.2.2.2

Tailles de population n´ ecessaires ` a l’obtention d’un g´ enotype recombinant de chaque cˆ ot´ e du gene

Hospital (2001) a propos´e une m´ethode d’optimisation des tailles de population pour obtenir un individu pr´esentant des recombinaisons de part et d’autre du g`ene cible. Le principe de la m´ethode propos´ee est le suivant : – Fixer le risque d’´echec de la s´election pour la r´eduction de la taille du segment introgress´e sur l’ensemble du programme – Fixer la g´en´eration la plus tardive `a laquelle l’individu double-recombinant doit ˆetre obtenu (tmax ) 22

Fig. 2.6 – Taille esp´er´ee du segment introgress´e d’un cˆot´e du g`ene cible. En abscisse : distance entre le g`ene et le marqueur (cM). Lignes pleines : µSI|M pour diff´erentes g´en´erations de backcross (not´ees BCt ). Lignes pointill´ees : taille esp´er´ee du segment introgress´e sans s´election sur marqueurs pour un g`ene cible distant de 100 cM de l’extr´emit´e du chromosome. D’apr`es Hospital (2001).

23

Encadr´ e 2 Calcul de tailles de population Si l’on connaˆıt la probabilit´e d’obtenir un individu de g´enotype voulu dans la descendance d’un croisement donn´e, il est possible de d´eterminer le nombre de descendants qu’il faut produire pour ˆetre sˆ ur, avec un risque α, d’obtenir au moins un individu de g´enotype voulu. Soit p la probabilit´e de l’´ev`enement consid´er´e, la probabilit´e de ne pas observer un seul individu parmi N de g´enotype voulu est (1 − p)N , qui est ´egale au risque α. Ainsi, on peut en d´eduire la taille de population minimale pour satisfaire le risque α : N=

ln(α) ln(1 − p)

(2.16)

– Calculer les tailles minimales de population permettant d’obtenir l’individu doublerecombinants au bout de tmax g´en´erations de backcross. Le tableau 2.1 pr´esente les tailles minimales de populations `a utiliser par g´en´eration pour obtenir un individu double-recombinant de part et d’autre du g`ene cible avec un probabilit´e d’´echec de 0.01. Ces calculs ont ´et´e effectu´es avec le programme popmin (Hospital et Decoux 2002). Le tableau pr´esente des r´esultats pour une, deux ou trois g´en´erations de backcross. Il est suppos´e que la mˆeme taille de population est utilis´ee `a chaque g´en´eration. Pour chaque dur´ee, sont indiqu´ees les probabilit´es de succ`es `a chaque g´en´eration (βg ). La somme de ces probabilit´es est ´egale `a la probabilit´e globale de succ`es : 0.99. De plus est indiqu´e le nombre moyen d’individus g´enotyp´es au cours des g´en´erations (¯ n). Formellement, ce nombre est calcul´e comme : tX max βt × t × ng (2.17) n ¯= t=1

Par exemple, avec tmax = 2 et l = 5, la probabilit´e de succ`es d`es la premi`ere g´en´eration est β1 = 0.19. Dans ce cas il ne faudra g´enotyper que ng = 184 individus. Si le doublerecombinant n’est pas obtenu, il faudra g´enotyper ng = 184 individus suppl´ementaires. Le double recombinant sera alors obtenu avec une probabilit´e de β2 = 0.80. Ces r´esultats montrent qu’il est extrˆemement coˆ uteux d’essayer d’obtenir l’individu voulu en une seule g´en´eration. Effectuer au moins deux g´en´erations de backcross est donc n´ecessaire. Dans ce cas, typiquement, un individu recombinant d’un seul cˆot´e du g`ene cible est obtenu `a la premi`ere g´en´eration, et la recombinaison de l’autre cˆot´e du g`ene est obtenue `a la deuxi`eme g´en´eration. On peut remarquer ´egalement, qu’effectuer trois g´en´erations de backcross est efficace pour r´eduire les coˆ uts. Le risque d’´echec est dilu´e sur plus de g´en´erations et les tailles de population correspondantes `a utiliser sont donc plus faibles. Par ailleurs, la probabilit´e de succ`es d`es la deuxi`eme g´en´eration est ´elev´ee. Ainsi, en planifiant 3 g´en´erations de backcross pour r´eduire la taille du segment introgress´e, il est tr`es probable de ne devoir effectuer que deux g´en´erations.

24

25

1.0 5.0 10.0 19.9

1.0 5.0 10.0 20.0

93 959 4 066 1 119 337

ng 1.0 5.0 9.8 19.2

µSI|M 921 184 92 47

ng 0.04 0.19 0.32 0.47

β1

β2 0.95 0.80 0.67 0.52

tmax = 2

1800.9 333.0 154.7 71.5

n ¯ 1.0 4.9 9.6 18.4

µSI|M 471 96 49 26

ng 0.02 0.10 0.18 0.30

β1 0.87 0.80 0.72 0.62

β2

tmax = 3

0.09 0.09 0.08 0.07

β3

975.2 975.2 93.1 46.1

n ¯

deux ou trois g´ en´ erations de backcross. Les marqueurs sont tous les deux `a la distance l du g`ene cible. D’apr`es Hospital (2001).

Tab. 2.1 – Tailles minimales de populations pour obtenir un double-recombinant de part et d’autre du g`ene cible en une,

µSI|M

l(cM)

tmax = 1

2.2.3

Utilisation de marqueurs pour acc´ el´ erer le retour au parent r´ ecurrent sur les chromosomes non-porteurs

Le g´enotype d’un individu aux marqueurs permet d’estimer sa composition g´en´etique. En se basant sur cette estimation, il est possible de s´electionner les individus pr´esentant la plus grande proportion de g´enome receveur dans leur fond g´en´etique. Nous allons voir dans cette partie comment estimer la composition g´en´etique des individus en se basant sur leur g´enotype aux marqueurs et montrer comment d´efinir un id´eotype aux marqueurs `a partir d’un objectif de taux de retour au parent r´ecurrent. 2.2.3.1

Estimation de la composition g´ en´ etique des individus bas´ ee sur leur g´ enotype aux marqueurs

La composition g´en´etique d’un individu pr´esentant un g´enotype connu aux marqueurs peut-ˆetre estim´ee en suivant une d´emarche similaire `a celle de Hill (1993) pr´esent´ee dans (t) le paragraphe 1.2.2. En reprenant des notations similaires nous appelons Zi|M une variable indicatrice de la provenance d’un locus i sur le g´enome d’un individu sachant son g´enotype (t) M aux marqueurs, `a la g´en´eration t, telle que Zi|M = 0 si le locus i provient du parent (t)

(t)

receveur et Zi|M = 1 s’il provient du parent donneur. L’esp´erance de Zi|M est : (t)

(t)

(t)

E(Zi|M ) = 1 × P (Zi|M = 1) + 0 × P (Zi|M = 0) (t)

(2.18)

= P (Zi|M = 1) (t)

o` u P (Zi|M = 1) est la probabilit´e que le locus i provienne du parent r´ecurrent, sachant son g´enotype aux marqueurs. La composition g´en´etique d’un individu porteur du g´enotype M aux marqueurs est (t) ¯ ZM . En consid´erant un k locus sur un chromosome, nous avons : k

1 X (t) (t) Z¯M = Z k i=1 i|M

(2.19)

Ainsi, la proportion attendue de g´enome receveur dans le fond g´en´etique d’un individu de g´enotype M aux marqueurs est : ! k X 1 (t) (t) E(Z¯M ) = E Z k i=1 i|M (2.20) k 1X (t) = P (Zi|M = 1) k i=1 En consid´erant une distribution continue de locus i de positions zi sur un chromosome de longueur L, nous pouvons ´ecrire : 26

(t) E(Z¯M )

1 = L

ZL

(t)

P (Zi|M = 1)dzi

(2.21)

0

De mani`ere similaire, nous pouvons calculer la variance de la composition g´en´etique d’individus porteurs du mˆeme g´enotype aux marqueurs M comme : ! k X 1 (t) (t) V ar(Z¯M ) = V ar Z k i=1 i|M (2.22) k k 1 XX (t) (t) (t) (t) E(Zi|M , Zj|M ) − E(Zi|M ) × E(Zj|M ) = 2 k i=1 j=1 Soit, en consid´erant une distribution continue de locus i et j de positions zi et zj sur le chromosome : 1 L2

ZL ZL 0

(t)

(t)

(t)

(t)

(t)

P (Zi|M = 1 , Zj|M = 1) − P (Zi|M = 1)P (Zj|M = 1)dzi dzj

(2.23)

0 (t)

o` u P (Zi|M = 1 , Zj|M = 1) est la probabilit´e conjointe que deux locus i et j soient tous les deux de g´enotype RR . Dans cette ´equation, nous les indices i et j peuvent d´enoter le mˆeme locus (quand zi = zj ), ce qui permet d’inclure les variances `a chaque position dans la somme calcul´ee. Les probabilit´es des g´enotypes possibles aux locus d’un chromosome conditionnelles au g´enotype aux marqueurs d’un individu peuvent ˆetre calcul´ees avec le programme MDM (Servin et al. 2002). Le calcul num´erique des ´equations 2.21 et 2.23 peut ˆetre effctu´e avec le programme bcdopt disponible `a l’adresse http ://moulon.inra.fr/~ servin utilisant les fonctions de calcul de MDM. 2.2.3.2

D´ efinition d’un id´ eotype aux marqueurs sur les chromosomes nonporteurs

L’id´eotype aux marqueurs sur les chromosomes non-porteurs est de g´enotype RR `a tous les marqueurs. Les ´etudes par simulations (e.g. Hospital et al. (1992), Visscher et al. (1996)) montrent que la s´election sur marqueurs permet facilement d’obtenir cet id´eotype. Le principe d’optimisation de la s´election sur les chromosomes non-porteurs est donc de d´eterminer le nombre et la position des marqueurs `a utiliser pour que la proportion r´eelle (dans l’ensemble du fond g´en´etique) de g´enome receveur des individus de g´enotype RR `a tous les marqueurs corresponde `a l’objectif de retour au parent r´ecurrent. Les positions optimales des marqueurs sur un chromosome non-porteur peuvent ˆetre d´ecrites `a l’aide d’un seul param`etre d qui est la distance entre le t´elom`ere “gauche” du chromosome et le marqueur “le plus `a gauche” plac´e sur le chromosome, comme l’ont 27

montr´e Visscher (1996) et Servin et Hospital (2002). La position des autres marqueurs sur le chromosome est compl`etement d´etermin´ee par cette distance d comme le montre la figure 2.7 T1

M1

M2

Mm−1

Mm

T2

... 

-

d

-



L−2d m−1

-

L−2d m−1



-

d -

L Fig. 2.7 – Positions de m marqueurs (M1 . . . Mm ) sur un chromosome de longueur L. Le param`etre d´ecrivant ces positions est la distance d entre le t´elom`ere “gauche” T1 et le marqueur “le plus `a gauche“ (M1 ). Sym´etriquement, le marqueur Mm est situ´e `a d centiMorgans du t´elom`ere T2 . Les autres marqueurs sont plac´es de mani`ere r´eguli`ere entre M1 et Mm . Les positions optimales des marqueurs sont alors d´etermin´ees en maximisant la proportion de g´enome receveur sur les chromosomes non-porteurs, i.e. en maximisant la valeur (t) de E(Zi|M ) (´equation 2.21), calcul´ee pour un g´enotype M aux marqueurs enti`erement homozygote receveur. Cette valeur maximale d´epend du nombre de marqueurs plac´es sur le chromosome et de la g´en´eration de backcross `a laquelle le g´enotype enti`erement RR aux marqueurs est obtenu . Ces positions optimales ont ´et´e d´etermin´ees par Servin et Hospital (2002). Le tableau 2.2 pr´esente les positions optimales (d∗ ) des marqueurs sur un chromosome non-porteur de 200 centiMorgans, les proportions estim´ees de g´enome receveur sur le chromosome (Π) et l’´ecart-type de Π (SDΠ ).

t 2 3 4 5



d 50 52 54 55

m=2 Π SDΠ 95.2 6.1 97.2 4.7 98.4 3.4 99.1 2.5



d 29 32 34 35

m=3 Π SDΠ 97.0 4.4 98.1 3.5 98.8 2.7 99.3 2.0



d 19 21 23 25

m=4 Π SDΠ 98.0 3.2 98.7 2.7 99.1 2.1 99.5 1.6



d 13 15 16 18

m=5 Π SDΠ 98.6 2.5 99.0 2.1 99.4 1.7 99.6 1.3

Tab. 2.2 – Positions optimales de 2, 3, 4 et 5 marqueurs (m), taux de retour attendu (Π) et son ´ecart type (SDΠ ) sur un chromosome non-porteur de longueur 200 centiMorgans, `a diff´erentes g´en´erations de backcross t. D’apr`es Servin et Hospital (2002) et Servin (in prep.). Les r´esultats pr´esent´es dans le tableau 2.2 montrent que plus le nombre des marqueurs utilis´es est grand, plus la proportion esp´er´ee (Π) de g´enome receveur chez les individus porteurs de l’id´eotype aux marqueurs est grande. Parall`element la variance de cette proportion est plus faible. Evidemment, en utilisant plus de marqueurs par chromosome, on estime 28

mieux la composition g´en´etique des individus. Cependant, utiliser beaucoup de marqueurs par chromosome implique des coˆ uts de g´enotypage plus ´elev´es (voir par ailleurs). Nous devons donc d´eterminer le nombre minimal (i.e n´ecessaire et suffisant) des marqueurs `a utiliser par chromosome non porteur qui permette de s’assurer que nous obtiendrons un individu ayant une composition g´en´etique r´eelle au moins ´egale `a notre objectif. Pour se faire, il faut, tout d’abord, que l’estimation de la composition g´en´etique donn´ee par les marqueurs (Π, tableau 2.2) soit au moins ´egale `a notre objectif. Mais il faut ´egalement s’assurer que la composition g´en´etique r´eelle de l’individu final ne s’´ecarte pas trop de la valeur moyenne, sinon, nous risquons de ne pas remplir notre objectif. Prenant ces consid´erations en compte, nous allons maintenant d´ecrire une d´emarche permettant de choisir le nombre des marqueurs `a utiliser par chromosome non-porteur. A la fin d’un programme de backcross assist´e par marqueurs, nous obtenons un certain nombre d’individus porteurs de l’id´eotype aux marqueurs parmi lesquels il faut s´electionner le meilleur, c’est-`a-dire celui qui a une proportion de g´enome receveur maximale. Notons d`es `a pr´esent que pour discriminer les individus, il est possible soit (i) de g´enotyper les individus pour plus de marqueurs sur le fond g´en´etique soit (ii) d’´evaluer leurs performances agronomiques. Le coˆ ut de cette s´election d´epend du nombre d’individus qu’il faut s´electionner. Pour limiter ce coˆ ut, il nous faut donc estimer le nombre minimum d’individus portant l’id´eotype aux marqueurs qu’il faut produire pour ˆetre sˆ ur (avec un risque d’´echec fix´e) que le meilleur d’entre eux ait une proportion de g´enome receveur au moins ´egale `a notre objectif. Formellement, ce nombre se calcule `a partir de la loi de r´epartition de Z (t) (voir encadr´e 3).

Fig. 2.8 – Fonction de r´epartition de la proportion de g´enome receveur (Pi) sur un chromosome non porteur de g´enotype RR `a 3 marqueurs `a la g´en´eration BC3 .

29

Encadr´ e 3 Calcul du nombre minimal NI ∗ d’individus pr´esentant l’id´eotype aux marqueurs I ∗ sur K chromosomes non-porteurs permettant d’assurer un taux de retour fix´e γ Principe Nous notons P (Π ≥ γ|I ∗ ) la probabilit´e qu’un individu porteur de l’id´eotype aux marqueurs ait une proportion r´eelle (sur l’ensemble du fond g´en´etique) de g´enome receveur sup´erieure ou ´egale `a γ. Nous allons supposer que nous voulons que les segments de g´enome donneur restant dans le fond g´en´etique des individus soient r´epartis de mani`ere ´equivalente sur tous les chromosomes. C’est-`a-dire que, si nous notons P (Πk ≥ x|I ∗ ) la probabilit´e que le chromosome k ait une proportion de g´enome receveur sup´erieure `a x, nous calculons P (Π ≥ γ|I ∗ ) comme : ∗

P (Π ≥ γ|I ) =

K Y

P (Πk ≥ γ|I ∗ )

(2.24)

k=1

Connaissant P (Π ≥ γ|I ∗ ), nous pouvons calculer NI ∗ en utilisant la m´ethode expos´ee dans l’encadr´e 2 Calcul de P (Π ≥ γ|I ∗ ) La probabilit´e P (Π ≥ γ|I ∗ ) est d´etermin´ee par la fonction de (t) r´epartition de ZM . Cependant, cette fonction n’est pas connue. Pour estimer cette fonction, il est possible d’avoir recours `a des simulations (Servin, in prep.). Ces simulations consistent a` reproduire le processus de backcross sur des chromosomes non-porteurs sur lesquels sont dispos´es des marqueurs et des locus du fond g´en´etique. Le processus de simulation est arrˆet´e apr`es un nombre fix´e de g´en´erations de backcross. Ensuite, seuls les chromosomes de g´enotype enti`erement receveur aux marqueurs sont conserv´es. Comme, dans ces simulations, nous pouvons observer les g´enotypes des locus du fond g´en´etique, nous pouvons estimer leur composition g´en´etique. Nous obtenons ainsi un ´echantillon qui (t) est un tirage dans la loi de distribution de Z¯M , et nous pouvons alors estimer la loi de r´epartition voulue. Cette m´ethode d’estimation d’une loi de r´epartition inconnue par tirage d’´echantillon dans la loi est appell´e int´egration par la m´ethode de Monte Carlo. Par exemple, `a partir de la fonction de r´epartition de la figure 2.8, nous pouvons d´eterminer que la probabilit´e qu’un individu portant un g´enotype enti`erement receveur `a 3 marqueurs sur un chromosome de 200 centiMorgans `a la g´en´eration BC3 ait une proportion r´eelle (sur l’ensemble du chromosome) de g´enome receveur sup´erieure `a 99% est ´egale `a 0,84. En consid´erant non pas un mais dix chromosomes non-porteurs de mˆeme g´enotype et de mˆeme taille, la probabilit´e que la proportion r´eelle de g´enome receveur sur l’ensemble de ces 10 chromosomes soit sup´erieure `a 99% est donc (0, 84)10 = 0, 175. Nous pouvons alors d´eterminer qu’il faut produire 24 individus porteurs de l’id´eotype aux marqueurs pour avoir 99% de chance que l’un d’entre eux ait effectivement un taux de retour au parent r´ecurrent sup´erieur `a 99% (voir encadr´e 2). La d´etermination du nombre des marqueurs `a utiliser sur les chromosomes non-porteurs doit donc prendre en compte les coˆ uts de g´enotypage au cours du programme de backcross 30

et `a la derni`ere ´etape de s´election pour trouver le meilleur individu possible. Nous verrons plus en d´etail un exemple de cette optimisation dans la section 2.3.2.

2.2.4

Conclusion partielle

A partir des r´esultats th´eoriques que nous venons d’exposer, nous avons vu que, grˆace aux marqueurs, nous pouvons esp´erer augmenter l’efficacit´e du backcross et remplir tous les objectifs de la s´election. S´ election sur le g` ene cible Les marqueurs permettent d’identifier les individus h´et´erozygotes pour le g`ene cible. Le gain relatif par rapport au backcross classique d´epend des coˆ uts de ph´enotypage des individus. Il est particuli`erement important dans le cas de g`enes r´ecessifs ou de QTL. S´ election pour r´ eduire la taille du segment introgress´ e Les marqueurs permettent d’identifier les individus pr´esentant des ´ev`enements de recombinaison favorables et de r´eduire drastiquement la taille des segments introgress´es autour des g`enes cibles. Il s’agit ici d’une am´elioration qualitative des produits issus de backcross, une telle r´eduction n’´etant quasiment jamais assur´ee lorsque les individus sont s´electionn´es sur leurs ph´enotypes. S´ election pour un retour au fond g´ en´ etique receveur Les marqueurs permettent d’ estimer les compositions g´en´etiques des individus au sein d’une population produite par backcross. Ils peuvent donc permettre une s´election tr`es fine sur le retour au parent r´ecurrent afin d’esp´erer r´eduire significativement le temps n´ecessaire `a l’obtention de g´enotypes satisfaisants.

2.3 2.3.1

Optimisation de la s´ election pour obtenir l’id´ eotype aux marqueurs ` a moindre coˆ ut Strat´ egie de marquage optimale

Dans un programme de s´election assist´e par marqueurs, il est important de mesurer a priori quels seront les besoins en g´enotypage pour obtenir l’id´eotype aux marqueurs. Le choix d’une bonne strat´egie de g´enotypage de la population peut permettre de maintenir les coˆ uts de marquage `a un faible niveau. Au sein d’une population produite par backcross, il faut s´electionner pr´ef´erentiellement des individus (i) h´et´erozygotes au g`ene cible, (ii) pr´esentant des recombinaisons de part et d’autre du g`ene (iii) de g´enotype homozygote receveur sur l’ensemble des marqueurs situ´es sur les chromosomes non-porteurs. Cet ordre de pr´ef´erence doit ˆetre conserv´e de fa¸con g´en´erale dans la strat´egie de marquage d’une population BCt pour r´eduire les coˆ uts de marquage. Etant donn´es les trois objectifs de la s´election cit´es ci-dessus, on peut distinguer trois sous-ensembles correspondant de marqueurs : les marqueurs de contrˆole du g`ene cible 31

Nt

- SubPop 1 

SubPop 2



-

SubPop 3



6

S´election Mgc

S´election MSI

S´election MF G

Backcross to the Recipient Parent

Fig. 2.9 – Sch´ema g´en´eral d’une strat´egie de marquage par ´etapes (Mgc ), les marqueurs flanquant le g`ene pour r´eduire la taille du segment introgress´e (MSI ) et enfin les marqueurs dispos´es sur le reste du g´enome pour acc´el´erer le retour au parent r´ecurrent ( MF G ). La strat´egie de marquage qui se r´ev`ele optimale dans des simulations de programme de backcross (figure 2.9) est expos´ee ci-dessous. Si l’on note Nt la taille de la population ´etudi´ee, `a chaque g´en´eration il faut s´equentiellement : 1. G´enotyper l’ensemble de la population pour les marqueurs de contrˆole du g`ene (Mgc ). En moyenne N2t individus sont porteurs du g`ene `a l’´etat h´et´erozygote 2. G´enotyper ces N2t individus pour les marqueurs flanquant le g`ene (MSI ) afin d’identifier les quelques recombinants soit simultan´ement de part et d’autre du g`ene soit d’un seul cˆot´e du g`ene. A cette ´etape l’intensit´e de s´election est tr`es ´elev´ee car peu de recombinants (nt ) seront disponibles pour l’´etape suivante. 3. G´enotyper les nt recombinants autour du g`ene introgress´e pour conserver celui qui pr´esente une proportion de marqueurs de g´enotype receveur maximale aux marqueurs MF G . Le coˆ ut en g´enotypage le plus important se situe dans les ´etapes (1) et (2). En effet, mˆeme si le nombre des marqueurs `a g´enotyper est faible, il faut g´enotyper beaucoup d’individus. Au cours de l’´etape (3), le nombre des marqueurs `a g´enotyper est plus ´elev´e qu’aux ´etapes (1) et (2) mais il ne faut g´enotyper que quelques individus (nt ). Hospital et al. (1992) envisageaient de n’effectuer de s´election sur les chromosomes non-porteurs des individus qu’`a partir de g´en´erations avanc´ees de backcross car le retour au parent r´ecurrent progressait mˆeme sans s´election sur le fond g´en´etique. Cependant, g´enotyper d`es les premi`eres g´en´erations, mˆeme un faible nombre d’individus, permet de d´etecter les marqueurs fix´es pour l’all`ele receveur. G´enotyper d`es les premi`eres g´en´erations permet donc, outre un faible gain d’efficacit´e sur le retour au parent r´ecurrent, une importante r´eduction du coˆ ut en diminuant le nombre des marqueurs `a typer dans les derni`eres g´en´erations. Il est donc important de toujours effectuer l’´etape (3), mˆeme si un seul individu est s´electionn´e `a la fin de l’´etape (2).

32

2.3.2

Illustration de la d´ emarche dans deux cas d’´ etude

Nous allons ´etudier dans cette partie deux cas d’introgression, repr´esentatifs de cas r´eels d’introgression de g`enes : Type Cultiv´ e Taux de retour voulu : 97%, longueur acceptable du segment introgress´e autour du g`ene cible : 10 centiMorgans. Dans ce cas, le retour au parent r´ecurrent souhait´e est mod´er´e. Il s’agit typiquement d’un objectif de s´election correspondant `a une introgression de g`ene issu d’un parent donneur de bonne qualit´e agronomique. Type Sauvage Taux de retour voulu : 99%, longueur du segment introgress´e : 2 centiMorgans. Dans ce cas, l’objectif de retour au parent r´ecurrent est tr`es ´elev´e. Il s’agit typiquement d’un objectif de s´election correspondant `a une introgression de g`ene issus d’un parent donneur de tr`es mauvaise qualit´e agronomique, par exemple une vari´et´e sauvage donneuse d’un g`ene de r´esistance `a une maladie. Pour d´eterminer les meilleures strat´egies permettant de remplir l’objectif de s´election de chaque cas ´etudi´e, nous suivons la d´emarche propos´ee dans la section 2.1. Nous d´eterminons donc tout d’abord l’id´eotype aux marqueurs que nous voulons obtenir puis simulons des programmes de backcross aboutissant `a cet id´eotype. Par simplicit´e, nous consid´erons que, pour chaque cas ´etudi´e, la taille de population au cours des g´en´erations est constante. Avec nos simulations, nous recherchons donc la taille minimale de population utilis´ee `a chaque g´en´eration, Ng qui permet d’obtenir l’id´eotype. Dans nos simulations, les individus ne sont s´electionn´es que sur leur g´enotype aux marqueurs. Les r´esultats obtenus pour chaque simulation sont des moyennes effectu´ees sur 1000 r´ep´etitions utilisant les mˆemes param`etres. Nous consid´erons qu’un programme de backcross est achev´e lorsque nous pouvons obtenir l’id´eotype aux marqueurs dans 99% des r´ep´etitions. La strat´egie de marquage utilis´ee est celle pr´esent´ee dans la section 2.3. Dans les deux cas, nous consid´erons un g´enome compos´e de 10 chromosomes de 200 centiMorgans. Le g`ene cible que nous d´esirons introgresser est situ´e au centre de l’un des chromosomes et est contrˆol´e par un marqueur intrag´enique. 2.3.2.1

Strat´ egies pour l’accomplissement des objectifs du Type Cultiv´ e

En comparant les r´esultats du tableau 2.2 avec nos objectifs de s´election, nous allons tout d’abord restreindre l’espace des param`etres `a ´etudier dans nos simulations. Ici, l’objectif de taux de retour au parent r´ecurrent est de 97%. Les attendus th´eoriques pr´esent´es dans le tableau 2.2 nous montrent que l’id´eotype aux marqueurs des chromosomes non-porteurs doit porter au moins 2 marqueurs par chromosomes et qu’il faut effectuer au moins trois g´en´erations de backcross. Par ailleurs, il ne semble pas n´ecessaire de placer plus de 3 marqueurs par chromosome non-porteur ni d’effectuer plus de 4 g´en´erations de backcross car les taux de retour esp´er´e sont alors bien au del`a de notre objectif, et impliquerait un coˆ ut trop important ´etant donn´e notre objectif de taux de retour. Finalement, `a la lumi`ere de ces attendus th´eoriques, nous pouvons d´eterminer a priori qu’il faut ´etudier des strat´egies d’introgression qui durent 2 `a 4 g´en´erations de backcross et utiliser de 2 `a 3 marqueurs par chromosomes non-porteurs. Nous pouvons alors d´efinir notre id´eotype aux marqueurs : 33

Distance marqueur g`ene (LD) nombre des marqueurs par chromosome non-porteur Position des marqueurs sur les chromosomes 2 marqueurs par chromosome non-porteur 3 marqueurs par chromosome non-porteur Dur´ee des programmes d’introgression

: 10 cM : 2 `a 3 non-porteurs : d = 50 cM : d = 30 cM : 2 `a 4 g´en´erations

Tab. 2.3 – Param`etres utilis´es dans les simulations de programme d’introgression de type cultiv´e D´ efinition de l’id´ eotype aux marqueurs – R´eduction de la taille du segment introgress´e : en suivant les conclusions de Hospital (2001), nous utilisons des marqueurs plac´es `a 10 centiMorgans de chaque cˆot´e du g`ene cible. – Retour au parent r´ecurrent sur les chromosomes non-porteurs : Pour r´eduire le nombre des param`etres entrant en jeu dans les simulations, les marqueurs ont ´et´e plac´es aux mˆemes positions, pour toutes les dur´ees de programme ´etudi´ees bien que les positions optimales calcul´ees soit variables au cours des g´en´erations. Les positions utilis´ees sont indiqu´ees dans le tableau 2.3. Tailles de population minimales pour obtenir l’id´ eotype aux marqueurs Une fois l’id´eotype aux marqueurs d´efini, nous recherchons les tailles de population minimales qu’il faut utiliser pour l’obtenir apr`es 2, 3 ou 4 g´en´erations de backcross. Pour d´eterminer cette valeur, nous avons effectu´e des simulations de programme de backcross dans les conditions pr´esent´ees dasn le tableau 2.3. Nous avons consid´er´e des tailles de population jusqu’`a 500 individus. En respectant cette contrainte, il est impossible d’obtenir l’id´eotype aux marqueurs en 2 g´en´erations, quelque soit le nombre de marqueurs par chromosome non-porteur ´etudi´e. Ainsi, nous limitons les strat´egies ´etudi´ees `a celles durant 3 ou 4 g´en´erations.

t 3 4

tLD 2 3

m=2 Π Ns N I ∗ 97.0 18 89 98.3 17 20

Ng MDP 100 819.6 40 409.7

tLD 2 3

m=3 Π Ns NI ∗ 97.8 19 60 98.7 14 15

Ng MDP 150 1534.8 40 504.2

Tab. 2.4 – Strat´egies d’introgression permettant de remplir les objectifs d’un programme de type cultiv´e (voir d´etails dans le texte) avec des tailles minimales de population Ng Le tableau 2.4 pr´esente les 4 meilleures strat´egies correspondant `a l’espace des param`etres ´etudi´es. Pour chacune des strat´egies, diff´erentes valeurs caract´eristiques sont indiqu´ees – La g´en´eration tLD `a laquelle est obtenu un individu double recombinant autour du g`ene cible. 34

– La proportion moyenne de g´enome receveur Π chez les individus obtenus `a la fin du programme de backcross. – Le nombre d’individus Ns pr´esentant l’id´eotype aux marqueurs `a la fin du programme de backcross. – Le nombre th´eorique NI ∗ d’individus pr´esentant l’id´eotype aux marqueurs qu’il faut produire pour avoir au moins un individu pr´esentant plus de 97 % de g´enome receveur dans son fond g´en´etique, avec un probabilit´e d’´echec de 0.05. NI ∗ est calcul´e avec la m´ethode pr´esent´ee dans l’encadr´e 3. – La taille de population `a utiliser `a chaque g´en´eration Ng – Le coˆ ut de g´enotypage exprim´e en Marker Data Points (MDP). Les strat´egies obtenues permettant de satisfaire les objectifs de retour au parent r´ecurrent du type cultiv´e sont : m=2 et t=3 Les objectifs de la s´election sont, en moyenne sur l’ensemble des simulations effectu´ees, atteints. Dans ce cas tLD = 2. Ainsi, il s’agit typiquement d’une strat´egie en deux temps, les deux premi`eres g´en´erations ´etant destin´ees `a obtenir le double recombinant de part et d’autre du g`ene cible, la derni`ere `a terminer la s´election sur le fond g´en´etique. Il faut cependant noter que le taux moyen de retour est exactement ´ le mˆeme que notre objectif de retour. Etant donn´ee la diff´erence entre Ns et NI ∗ dans ce cas, 18 individus pr´esentant l’id´eotype aux marqueurs `a la fin du programme ne suffisent pas pour ˆetre certain d’obtenir un individu satisfaisant l’objectif de taux de retour au parent r´ecurrent. m=2 et t=4 Le taux de retour moyen obtenu en suivant cette strat´egie (98.3%) est sup´erieur `a l’objectif de taux de retour au parent r´ecurrent. En comparant Ns et NI ∗ , on peut constater qu’il est assez probable que parmi les 17 individus obtenus `a la fin du programme, il soit possible d’en trouver un satisfaisant l’objectif de 97% de taux de retour au parent r´ecurrent. Dans cette strat´egie, l’individu double recombinant est obtenu `a la troisi`eme g´en´eration. Ainsi, durant ces g´en´erations, est op´er´ee une s´election sur le fond g´en´etique parall`element `a la s´election pour obtenir le double recombinant. Cette strat´egie est la moins on´ereuse en g´enotypage de toutes. La taille de population `a utiliser `a chaque g´en´eration est par ailleurs faible (40 individus). m=3 et t=3 Le taux de retour moyen obtenu par cette strat´egie (97.8%) est l´eg`erement sup´erieur `a l’objectif fix´e. Cependant, le nombre d’individus portant l’id´eotype aux marqueurs est nettement plus faible que NI ∗ , ce qui montre qu’il y a une faible probabilit´e d’obtenir un individu satisfaisant les objectifs de s´election. m=3 et t=4 Le taux de retour moyen obtenu par cette strat´egie (98.7%) est tr`es sup´erieur au taux de retour voulu. Par ailleurs le nombre d’individus portant l’id´eotype aux marqueurs est tr`es proche de NI ∗ , ce qui montre qu’il y a une probabilit´e ´elev´ee d’obtenir un individu satisfaisant les objectifs de s´election. L’´etude de ces strat´egies montre qu’il est n´ecessaire d’effectuer au moins quatre g´en´erations de backcross pour remplir les objectifs du cas type cultiv´e. Bien que les taux de retour moyens apr`es trois g´en´erations soient proches de l’objectif, les variances sont grandes et donc le risque d’´echec de ces strat´egies est ´elev´e. 35

Distance marqueur g`ene (LD) nombre des marqueurs par chromosome non-porteur Position des marqueurs sur les chromosomes 2 marqueurs par chromosome non-porteur 3 marqueurs par chromosome non-porteur 4 marqueurs par chromosome non-porteur 5 marqueurs par chromosome non-porteur Dur´ee des programmes d’introgression

: 2 cM : 2 `a 5 non-porteurs : d = 50 cM : d = 30 cM : d = 20 cM : d = 15 cM : 3 `a 5 g´en´erations

Tab. 2.5 – Param`etres utilis´es dans les simulations de programme d’introgression de type sauvage 2.3.2.2

Strat´ egies pour l’accomplissement des objectifs du Type Sauvage

Nous utilisons `a nouveau le tableau 2.2 pour restreindre l’espace des param`etres que nous allons explorer par simulations. Etant donn´e nos objectifs de s´election, nous pouvons constater que nous devons explorer un ´eventail assez large de possibilit´es. Nous devons explorer les strat´egies : – impliquant 2 ou 3 marqueurs par chromosome non-porteur et durant 5 g´en´erations – 4 marqueurs par chromosome non-porteur et durant 4 ou 5 g´en´erations. – 5 marqueurs par chromosome nonporteur et durant 3 `a g´en´erations D´ efinition de l’id´ eotype aux marqueurs – R´eduction de la taille du segment introgress´e : en suivant les conclustions de Hospital (2001) , nous utilisons des marqueurs plac´es `a 2 centiMorgans de chaque cˆot´e du g`ene cible. – Retour au parent r´ecurrent sur les chromosomes non-porteurs : comme pour le cas pr´ec´edent, nous pla¸cons les marqueurs aux mˆemes positions quelque soit la dur´ee de la strat´egie ´etudi´ee. Les positions utilis´ees sont indiqu´ees dans le tableau 2.5 Tailles de population minimales pour obtenir l’id´ eotype aux marqueurs En respectant la contrainte sur la taille maximale de population de 500 individus, il est impossible d’obtenir l’id´eotype `a 5 marqueurs par chromosome non-porteur en 3 g´en´erations. le tableau 2.6 pr´esente les r´esultats obtenus pour les autres configurations possibles. Les significations des valeurs du tableau sont les mˆemes que dans le cas cultiv´e.

36

t 5

t 4 5

tLD 5

m=2 Π Ns NI ∗ 99.2 64 25

tLD 4 5

m=4 Π Ns NI ∗ 99.2 84 30 99.5 64 13

Ng MDP 130 1374

Ng MDP 170 2054 130 1912

tLD 5

m=3 Π Ns NI ∗ 99.4 64 21

Ng MDP 130 1591

tLD 4 5

m=5 Π Ns NI ∗ 99.3 52 30 99.6 63 12

Ng MDP 170 2330 130 2195

Tab. 2.6 – Strat´egies d’introgression permettant de remplir les objectifs d’un programme de type sauvage (voir d´etails dans le texte) avec des tailles minimales de population Ng Nous pouvons constater que les strat´egies de mˆeme dur´ee n´ecessitent la mˆeme taille de population Ng . Or, les strat´egies de mˆeme dur´ee ne diff`erent que par le nombre des marqueurs sur les chromosomes non-porteurs. La probabilit´e d’obtenir l’id´eotype aux marqueurs sur les chromosomes non-porteurs n’est donc pas limitante pour ces strat´egies. En revanche dans toutes les strat´egies ´etudi´ees, le facteur limitant est l’obtention d’un individu double recombinant de part et d’autre du g`ene. Nous avons vu dans la section 2.2.2.2 comment calculer les tailles minimales de population pour obtenir l’individu double recombinant au bout de t g´en´erations avec le programme popmin (Hospital et Decoux 2002). Les tailles fournies par popmin sont respectivement ng = 162 ou ng = 123 pour obtenir l’individu en 4 ou 5 g´en´erations de backcross ce qui est coh´erent avec les tailles minimales de population Ng que nous trouvons par simulations. Le facteur limitant est donc bien l’obtention du g´enotype double recombinant de part et d’autre du g`ene cible. Ainsi, les tailles de population minimales calcul´ees par popmin sont suffisantes pour obtenir ´egalement l’id´eotype aux marqueurs des chromosomes non-porteurs. Dans toutes les strat´egies ´etudi´ees, le taux de retour moyen est sup´erieur `a 99%, comme nous nous y attendions d’apr`es les taux de retour th´eoriques du tableau 2.2. Par ailleurs, pour toutes les strat´egies le nombre d’id´eotype aux marqueurs Ns est sup´erieur au nombre NI ∗ , ce qui indique qu’il est fort probable de trouver un individu satisfaisant tous les crit`eres de s´election. Au final, ce qui diff´erencie les diff´erentes strat´egies est le coˆ ut en g´enotypage (MDP). Nous pouvons constater que les strat´egies les plus coˆ uteuses sont celles qui durent 4 g´en´erations. Ceci s’explique par une taille minimale de population Ng ´elev´ee `a chaque g´en´eration. Par ailleurs, le nombre NI ∗ d’individus `a produire pour satisfaire les objectifs de s´election en 4 g´en´erations est plus important que pour les strat´egies durant 5 g´en´erations. Ces strat´egies seront donc plus coˆ uteuses que les strat´egies durant 5 g´en´erations. Si l’on veut privil´egier les programmes de courte dur´ee, la strat´egie id´eale consiste `a utiliser 5 marqueurs par chromosome non-porteur tandis que si l’objectif est de r´eduire au plus les coˆ uts de g´enotypage, il faut effectuer 5 g´en´erations de backcross et utiliser 4 marqueurs par chromosome non-porteur.

37

R´ ef´ erences Haldane, J., 1919 The combination of linkage values and the calculation of distances between the loci and the linked factors. Journal of Genetics 8 : 299–309. Hanson, W. D., 1959 Early generation analysis of lengths of heterozygous chromosome segments around a locus held heterozygous with backcrossing or selfing. genetics 44 : 833–837. Hill, W., 1993 Variation in the genetic composition in backcrossing programs. Journal of Heredity 84 : 212–213. Hospital, F., 2001 Size of donor chromosome segments around introgressed loci and reduction of linkage drag in marker-assisted backcross programs. Genetics 158(3) : 1363–1379. Hospital, F. et A. Charcosset, 1997 Marker-assisted introgression of quantitative trait loci. Genetics 147(3) : 1469–1485. Hospital, F., C. Chevalet, et P. Mulsant, 1992 Using markers in gene introgression breeding programs. Genetics 132(4) : 1199–1210. Hospital, F. et G. Decoux, 2002 Popmin : a program for the numerical optimization of population sizes in marker-assisted backcross programs. Journal of Heredity 93(5) : 383–384. Melchinger, A., 1990 Use of Molecular markers in breeding for oligogenic disease resistance. Plant Breeding 104 : 1–19. Naveira, H. et A. Barbadilla, 1992 The theoretical distribution of lengths of intact chromosome segments around a locus held heterozygous with backcrossing in a diploid species. Genetics 130 : 205–209. Servin, B., C. Dillmann, G. Decoux, et F. Hospital, 2002 Mdm : a program to compute fully informative genotype frequencies in complex breeding schemes. Journal of Heredity 93(3) : 227–228. Servin, B. et F. Hospital, 2002 Optimal positioning of markers to control genetic background in marker-assisted backcrossing. Journal of Heredity 93(3) : 214–217. Stam, P. et A. Zeven, 1981 The theoretical proportion of the donor genome in near-isogenic lines of self-fertilizers bred by backcrossing. Euphytica 30 : 227–238. Visscher, P., 1996 Proportion of the variation in genomic composition in backcrossing explained by genetic markers. Journal of Heredity 87 : 136–138.

38

Visscher, P., C. Haley, et R. Thompson, 1996 Marker-assisted introgression in backcross breeding programs. Genetics 144(4) : 1923–32. Young, N. et S. D. Tanksley, 1989 RFLP analysis of the size of chromosomal segments retained aroung the Tm-2 locus of tomato during backcross breeding. Theoretical and Applied genetics 77 : 353–359.

39

Optimal Positioning of Markers to Control Genetic Background in MarkerAssisted Backcrossing B. Servin and F. Hospital Molecular markers are commonly used in backcross breeding programs in plants. As genetic maps contain more and more markers, it is of interest to determine which markers are to be used for selection. Here we describe how one can compute an optimal positioning of markers resulting in a maximization of the expected proportion of recipient genome. This criterion allows us to take selection into account and to produce relevant results regarding the final efficiency of background selection in backcross programs. Molecular markers have proven very useful in improving backcross breeding schemes. Particularly, markers allow us to estimate the genomic composition of individuals, and selection on markers can speed up the recipient genome recovery on noncarrier chromosomes ( background selection). Several studies have shown that few markers (typically 2–4 markers/ Morgan) are necessary to control genetic background in marker-assisted backcrossing ( Hospital et al. 1992; Visscher et al. 1996). Yet more than 2–4 markers/Morgan are generally available. If we assume that all markers on the genetic map have the same technical benefits (codominance, polymorphism between parents), the choice of the markers to use for background selection has to be made according to their positions on the genetic map. Few studies have evaluated the optimal positioning of markers to improve background selection efficiency. Hospital et al. (1992) showed the impact of marker positions on background selection efficiency based on simulation studies, and determined roughly the optimal positioning of two markers on a chromosome of 100 cM. Visscher (1996) computed the optimal positioning of markers, defined as the positions for which markers best explained the variation in genomic composition of the chromosomes. In Visscher (1996), the proportion of variance explained by markers is derived analytically, based on previous calculations from Hill (1993), under the assumption of no background selection on markers. The idea is that markers that explain most of the variation prior to selection would be the most efficient to se-

214 The Journal of Heredity 2002:93(3)

lect for. However, this ignores the effects of selection on markers over successive generations. Note that it is widely acknowledged in population and quantitative genetics that such effects of selection are barely amenable analytically. We suggest here a different approach to determine the optimal positioning of markers, taking the effects of selection into account. The aim of a backcross selection program is that any locus but the gene introgressed from the donor line eventually returns to a homozygous recipient type. Even without background selection on markers, this is just a matter of time (i.e., of the number of backcross generations). The aim of selection on markers is to go faster toward fixation than without selection on markers. However, it is known that selection on the markers themselves is very efficient, with 2–4 markers/Morgan. And obviously, once they are fixed, markers become useless for selection. Hence what is really important is not whether markers will be fixed or not, but how much of the genome outside the markers will be fixed for recipient type by the time the markers are fixed. Based on these considerations, we propose to define optimal marker positions as the positions that maximize the genomewide proportion of loci that are fixed for homozygous recipient type once the markers are fixed for homozygous recipient type (i.e., selection on markers has been successful). This is evaluated from the expected probability that any locus on the genome is of recipient type, given that all markers are of recipient type.

Methods Throughout the article, we will assume that recombination takes place without interference and will use Haldane’s mapping function to compute genetic distances from recombination rates. Since only one chromosome of each pair is segregating in backcrossing, the analytical derivations and numerical applications are related to the segregating chromosome throughout the article. The criterion computed (⌸) is the expected proportion of recipient genome, given that all markers are of recipient type, which obviously can be addressed on a per chromosome basis as ⌸ ⫽ 100

冕 0

L

1 P(X円M ) dx, L

(1)

where L is the chromosome length and P(X円M) is the probability that a locus X at position x on the chromosome is of ho-

Figure 1. Positioning of m markers (M1, . . ., Mm) on a chromosome of size L. The parameter used to describe the positioning is the distance d between telomere T1 and the first marker M1. The other markers are equally spaced in [M1, Mm] as described in the text.

mozygous recipient type given that all markers are of homozygous recipient type on this chromosome. The value of ⌸ thus depends on the number and the positioning of markers, and on the backcross generation at which all markers are of homozygous recipient type (i.e., the last generation of background selection). For a given number of markers m, the positioning of markers on a chromosome is described by a single parameter d (see Figure 1), the distance between the first telomere (T1) and the first marker (M1); d is also the distance between the last marker (Mm) and the second telomere (T2). For m ⬎ 2, the other markers (M2 to Mm⫺1) are equally spaced in the segment [M1,Mm], as was also done by Visscher (1996), who used the same parameter d. Hence the chromosome is composed of two segments delimited by a telomere and a marker ( herein called TM segments), of size d, and (m ⫺ 1) segments delimited by two successive markers ( herein called MM segments), of size (L⫺2d)/(m⫺1). The closed form of ⌸ for two markers at generation BC1 can be obtained by analytical derivations (see appendix): ⌸⫽





冣冣,

1 1 rM1M2 1 ⫹ 2rTM ⫹ 2 L 1 ⫺ rM1M2

(2)

where rTM is the recombination rate between T1 and M1 (and between T2 and M2), and rM1M2 is the recombination rate between M1 and M2: rTM ⫽ ½(1 ⫺ e⫺2d) and rM1M2 ⫽ ½(1 ⫺ e⫺2(L⫺2d)). To find optimal marker positions, ⌸ must then be maximized for d ∈ [0,L⁄2]. Computing ⌸ for more markers or for more advanced backcross generations is hardly amenable analytically. We then computed P(X円M) using MDM, a program designed for the numerical computation of expected genotype frequencies at multiple loci (Servin et al. 2002). From the results of MDM, ⌸ was approximated by summing P(X円M) for discrete values of x, equally spaced along the chromosome, with a step of 0.1 cM. A smaller step was tried but did not produce significantly more accurate results. We derived ⌸ on the chromosome for different numbers m of mark-

Figure 2. Estimated proportion of recipient genome (⌸) on MM segments (dotted line), TM segments (scattered line), and on the whole chromosome (solid line) as a function of the positioning of two markers (d) on a chromosome of 100 cM for backcross generation BC3. The dot indicates the maximum of ⌸ of coordinates (d*, ⌸*).

ers and for different positionings of the markers. The closer a locus at position x is to a marker, the higher is P(X円M). When d increases, the size of the MM segments decreases and P(X円M) at any locus on MM segments increases. Conversely, when d increases, the size of TM segments increases, and P(X円M) decreases at any locus on TM segments. As ⌸ is a linear combination of these probabilities, it presents a maximum (noted ⌸*) for an optimal value of d (noted d*), giving the optimal positioning of the markers. Figure 2 illustrates variations in the genomic composition as a function of d for a chromosome of 100 cM controlled by

two markers of fully recipient genotype at generation BC3. As explained above, the proportion of recipient genome on MM segments (dotted lines) increases with d and the proportion of recipient genome on TM segments (scattered lines) decreases as d increases. Finally, ⌸ presents a maximum of coordinates (d*, ⌸*) indicated by the dot in Figure 2. Qualitatively, similar results are obtained with any number of markers and at any backcross generation.

Results and Discussion Optimal Positioning Table 1 shows the optimal positioning (d*) of two to four markers on a 100 cM chro-

Table 1. The optimal positioning (d*) of m markers on a chromosome of 100 cM and the corresponding proportion (⌸*) of recipient genome for different backcross generations (BC) Theoretical proportion of recipient genome on the chromosome when no selection is performed (␲) and optimal marker positioning (dV96) from Visscher (1996) are recalled m

BC

␲ (%)

⌸* (%)

d* (cM)

dV96 (cM)

d* ⫺

d* ⫹

2

1 2 3 1 2 3 1 2 3

75 87.5 93.75 75 87.5 93.75 75 87.5 93.75

93.4 95.2 96.9 97.1 97.6 98.3 98.5 98.6 98.9

18.6 21.4 22.9 8.4 11.0 12.6 4.5 6.5 7.8

27.5 28.0 28.6 18.3 18.8 19.2 13.6 13.9 14.2

10.4 10.0 7.1 0 0 0 0 0 0

27.0 32.8 38.6 17.9 23.5 29.7 14.4 19.5 25.2

3

4

mosome in backcross generations BC1– BC3, as well as the corresponding ⌸*. The theoretical proportion of recipient genome without selection on markers is also recalled as a comparison (␲). Finally, Table 1 recalls the optimal positioning of Visscher (1996), expressed as corresponding d values. It is seen from Table 1 that optimal d* values slightly increase with the backcross generations. This can be interpreted as follows. In BC1, the optimal length of MM segments, (L ⫺ 2d*)/(m ⫺ 1), is much larger than the length of TM segments (d*), because segments flanked by two selected markers are better controlled than segments flanked by only one marker. But, as meioses accumulate in advanced backcross generations, the apparent recombination rate between markers increases, and MM segments tend to be not better controlled than TM segments. The optimal size of MM segments relative to TM segments then need to be reduced compared to its value in BC1. The variation of d* is more important between generation BC1 and BC2 than between more advanced generations (e.g., BC2 and BC3) as seen in Table 1. Indeed, as only one generation of recombination has taken place in BC1, it is very likely that the MM segments are introgressed as a whole and are fully recipient, because the probability that double recombination events occurred between the markers is very low, except for very large MM segments. In later backcross generations, loss of control on MM segments is due not only to (rare) double recombinations between the markers, but also to (more frequent) single recombinations between the markers occurring twice in different generations. Thus the apparent recombination rate between markers increases faster between generation BC1 and BC2 than between two other backcross generations. Suboptimal Positioning Even with dense genetic maps, it can be hard to find markers exactly positioned at optimal positions d*. In this case, it is of interest to know the impact on genome content of using markers not exactly placed in the optimal positions described above (suboptimal positioning of markers). The last two columns of Table 1 show the positions (d*⫺ and d*⫹) defining an interval in which ⌸* ⫺ 1% ⱕ ⌸ ⱕ ⌸*. Using more markers on a chromosome leads to better control of the return to the recipient genome because the regions controlled by the markers overlap. Thus

Brief Communications 215

better control of the genomic background can be achieved either by using more markers, that can be suboptimally placed, or by using fewer markers, optimally placed. For example, in generation BC2, using four suboptimally placed markers leads to an expected ⌸ of 97.6%, which can be obtained with three well-placed markers (⌸* ⫽ 97.6%). For a given number of markers, the impact of suboptimal positioning of markers is less important when the backcross generation is more advanced. Indeed, even when no selection on markers is performed, the recurrent genome content increases, due to backcrossing. Thus for a given number of markers, the same value of ⌸ can be reached either by optimally placing markers and performing fewer backcross generations or by suboptimally placing the markers and performing more backcross generations. For example, a chromosome of fully recipient genotype at three markers will present an expected proportion of recipient genome of ⌸ ⫽ ⌸* ⫽ 97.6% at generation BC2 if markers are optimally placed. If markers are suboptimally placed, the same return to the recipient genome will be obtained at generation BC3 (⌸ ⱕ 97.3%). Optimization Criterion We found that optimal positions of two markers on a chromosome of 100 cM are about 20 cM from the telomeres (from 18.6 cM in BC1 to 22.9 cM in BC3 as recalled from Table 1). These results slightly differ from those in Visscher (1996), where the optimal marker positions are around 28 cM from the telomeres in the same conditions. Generally the positioning described in Visscher (1996) is farther from the telomeres than ours. The main difference is that optimal d* values here are given conditional on the success of selection, whereas the values given by Visscher (1996) are obtained assuming no selection on markers. This could explain the difference between our respective results. Conversely, our results for two markers fit well those of Hospital et al. (1992), obtained by simulations that take fully into account selection on markers. In fact, they found that the optimal positioning of two markers on a 100 cM chromosome was roughly at 20 cM from the telomeres. This argues for a better relevance of our optimization criterion ⌸ to predict marker positions that maximize the response to selection (i.e., the return to recipient parent genome) compared to the one used by Visscher

216 The Journal of Heredity 2002:93(3)

(1996), because ⌸ takes selection into account. The optimal positioning given in Visscher (1996) is based on the linear prediction of the proportion of recipient genome in a population composed of individuals presenting every possible genotype at markers. However, the relationship between the proportion of recipient genome and the possible genotypes at markers is linear only in BC1. Indeed, using such a linear predictor in BC1 leads to results very close to the ones obtained using our estimate ⌸. For more advanced backcross generations, the relationship is no longer linear, and a linear predictor is not a good estimate of the proportion of recipient genome. Efficiency of Background Selection The proportion of recipient genome (⌸*) obtained for the optimal positioning is high compared to the theoretical values when no selection on markers is performed (␲), as shown in Table 1. For example, a noncarrier chromosome presenting three optimally placed markers that are of recipient type in BC3 will have 99.2% of recipient genome. Without selection on markers, the same return to the recipient genome would be obtained only in BC6. Thus when all markers are of recipient type, it is expected that most of the genome is of recipient type. This confirms previous studies by showing that few markers can efficiently control large chromosomal regions ( Hospital et al. 1992; Visscher et al. 1996). Although the criterion we used to infer optimal positioning is based on the success of selection on markers, our study does not allow us to predict the efficiency of selection on markers, but previous studies have shown that it is very efficient. For example, Hospital et al. (1992) considered background selection on two markers per chromosome for 10 noncarrier chromosomes of 100 cM. They showed that, selecting down to a proportion of 10% individuals at each generation, homozygous recipient genotypes at all markers can be obtained as early as BC3. In the case of fewer noncarrier chromosomes and/or higher selection intensity, background selection may succeed in only one or two generations. Our method could be extended to background selection on carrier chromosomes, but the optimal positioning will then depend on the position of the target gene (or of markers controlling it) and on the po-

sitions of markers used to reduce the linkage drag around the target gene.

Appendix: Analytical Derivation of ⌸ for Two Markers in Generation BC1 We consider a chromosome controlled by two markers (M1 and M2) positioned as explained in Figure 1. We denote rM1M2 the recombination rate between M1 and M2, and rTM the recombination rate between T1 and M1. As the distance between T2 and Mm is also d, rTM is also the recombination rate between T2 and M2. We assume that recombination rates are related to genetic distances by Haldane’s mapping function, and thus rM1M2 ⫽ ½(1 ⫺ e⫺2(L⫺2d)) and rTM ⫽ ½(1 ⫺ e⫺2d). We also consider an unmarked locus, noted X, placed at position x on the chromosome. As recalled from Equation (1) in the method section, ⌸ ⫽ 100

冕 冕

L

1 P(X円M ) dx L

L

1 P(X 艚 M) dx, L P(M)

0

⫽ 100

0

(A.1)

where P(X 艚 M) is the probability to have the three loci X, M1, and M2 of homozygous recipient genotype, and P(M) is the probability to have both markers M1 and M2 of homozygous recipient genotype. As markers are placed symmetrically to the center of the chromosome, and as only P(X 艚 M) is a function of x, Equation (A.1) can be rewritten as ⌸ ⫽ 100 ␣(d, L)



L/2

P(X 艚 M) dx,

(A.2)

0

where ␣(d, L) ⫽ 2/LP(M) and P(M) ⫽ ½(1 ⫺ rM1M2). P(X 艚 M) has to be divided into two parts to compute Equation (A.2), depending on the relative positions of X and M1: P(X 艚 M) ⫽



PTM (x, d, L)

when x ∈ [0, d]

PMM (x, d, L) when x ∈ [d, L/2].

Let r1 denote the recombination rate between X and M1, and r2 the recombination rate between X and M2. Using Haldane’s mapping function we have



1 r1 ⫽ (1 ⫺ e⫺2円d⫺x円 ) 2 1 r2 ⫽ (1 ⫺ e⫺2(L⫺d⫺x) ). 2

Computing # d0 PTM (x, d, L) dx As X is on the TM segment, PTM(x, d, L) ⫽ ½(1 ⫺ r2 ⫺ rM1M2r1). In this case, r1 ⫽ ½(1 ⫺ e⫺2(d⫺x)). Developing r1 and r2 as functions of x, we obtain

References

PTM (x, d, L)

Servin B, Dillmann C, Decoux G, and Hospital F (2002). MDM: a program to compute fully informative genotype frequencies in complex breeding schemes. J Hered 93:227–228

[

]

1 1 1 (1 ⫺ rM1M2 ) ⫹ (e⫺2d ⫹ e⫺2(L⫺d) )e 2x . 2 2 4



(A.3)

d

PTM (x, d, L) ⫽

0

1 (1 ⫺ rM1M2 )(d ⫹ rTM ). 4

Computing # L/2 d PMM (x, d, L) dx As X is on the MM segment, PTM(x, d, L) ⫽ ½(1 ⫺ rM1M2 ⫺ r1r2). In this case, r1 ⫽ ½(1 ⫺ e⫺2(x⫺d)). Developing r1 and r2 as functions of x, we obtain PTM (x,d,L)



Visscher PM, 1996. Proportion of the variation in genomic composition in backcrossing programs explained by genetic markers. J Hered 87:136–138.

Received April 23, 2001 Accepted December 31, 2001

(A.4)



Hospital F, Chevalet C, and Mulsant P, 1992. Using markers in gene introgression breeding programs. Genetics 132:1199–1210.

Visscher PM, Haley CS, and Thompson R, 1996. Marker assisted introgression in backcross breeding programs. Genetics 144:1923–1932.

Integrating Equation (A.3) gives



Hill WG, 1993. Variation in genetic composition in backcrossing programs. J Hered 84:212–213.



1 1 1 1 (1⫺rM1M2 ) ⫹ e2d e⫺2x ⫹ e⫺2(L⫺d) e2x 2 2 4 4

Corresponding Editor: Leif Andersson

Polymorphic Microsatellites in Antirrhinum (Scrophulariaceae), a Genus With Low Levels of Nuclear Sequence Variability D. Zwettler, C. P. Vieira, and C. Schlo¨tterer

(A.5) Integrating Equation (A.5) gives



L/2

PTM (x,d,L)

d







1 1 1 (1⫺rM1M2 )(L⫺2d) ⫹ rM1M2 . 2 4 4 (A.6)

Obtaining ⌸ Finally, from Equation (A.2), ⌸ ⫽ 100␣(d, L)

冢冕

d

PTM (x, d, L) dx

0







L/2

PMM (x, d, L) dx

d

(A.7)





冣冣.

1 1 rM1M2 ⌸ ⫽ 100 1 ⫹ 2rTM ⫹ 2 L 1 ⫺ rM1M2

(A.8) From the Station de Ge´ne´tique Ve´ge´tale, INRA/UPS/INAPG, Ferme du Moulon, 91190 Gif sur Yvette, France. The authors wish to thank P. M. Visscher and one anonymous referee for their helpful comments. Address correspondence to Bertrand Servin at the address above or e-mail: [email protected].  2002 The American Genetic Association

In Antirrhinum, reproductive systems range from self-compatible to self-incompatible, but the actual outcrossing rates of self-compatible populations are not known. Thus the extent to which levels of variability and inbreeding differ among Antirrhinum populations is not known. In order to address this issue we isolated nine Antirrhinum nuclear microsatellite loci. In contrast to several nuclear genes that show low levels of sequence variation, six of the microsatellite loci indicate high levels of variability within and between Antirrhinum species. The highly self-compatible Antirrhinum majus ssp. cirrhigerum population has high levels of variability and no significant deviation from Hardy– Weinberg equilibrium, suggesting substantial rates of outcrossing. The mating system in plants is determined by many factors, including features of the reproductive system, such as self-incompatibility mechanisms and protandry (i.e., the amount of time separating anther dehiscence and the start of stigma exertion) in hermaphroditic species, pollinator behavior, selective abortion by maternal regulation of seed quality, flowering phenology (i.e., variation in floral display and structure), and population density (Shaanker et al. 1988; Marshall and Folsom

1991). The mating system affects the distribution of genetic variability, both within and between populations. For several reasons, highly inbreeding populations are expected to have low levels of variability relative to closely related outcrossing populations. Inbreeding reduces the effective population size (Pollak 1987) and lowers effective rates of recombination due to the rarity of heterozygous individuals. Reduced recombination is associated with an increased effect of adaptive gene substitutions on neutral variability at linked sites (i.e., hitchhiking; Maynard Smith and Haigh 1974) and an increased effect of selection against deleterious alleles on neutral variation at linked sites (i.e., background selection; Charlesworth et al. 1993). Both processes tend to reduce neutral variability (reviewed in Charlesworth and Charlesworth 1998). Also, polymorphisms maintained by overdominance in outcrossing populations tend to be lost under inbreeding (Charlesworth and Charlesworth 1995; Kimura and Ohta 1971). In addition to these nonneutral effects, population structure has also been suspected to affect inbreeders. When selfing species are more likely to occur in metapopulations with high rates of extinction, this will also contribute to lower levels of variability in selfing populations ( Barton and Whitlock 1997; Wade and McCauley 1988). These theoretical predictions have been verified to a large extent by allozyme data, which consistently show higher levels of within-population variability in outcrossing than in selfing populations ( Brown 1979; Hamrick and Godt 1990, 1996; Schoen and Brown 1991). While sequence variation data are still scarce, the available reports show the expected pattern of reduced diversity in inbreeders (Awadalla and Ritland 1997; Dvorak et al. 1998; Liu et al. 1998, 1999; Stephan and Langley 1998; Savolainen et al. 2000). Recently several populations and species of Antirrhinum were characterized for their percentage of autogamy and self-fertility, and large variation was observed ( Vieira 2000). However, the actual outcrossing rate is not known for self-compatible populations. In a recent attempt to correlate sequence variability with mating system, nuclear genes of the cycloidea and fil1 gene families were sequenced ( Vieira and Charlesworth 2001a; Vieira et al. 1999). The low levels of sequence polymorphism observed in these studies made it difficult to correlate sequence variation with reproductive system. Furthermore,

Zwettler et al • Microsatellite Variation in Antirrhinum 217

Using markers to reduce the variation in the genomic composition in marker assisted backcrossing Bertrand SERVIN1∗ 1

UMR de G´en´etique V´eg´etale INRA / CNRS / UPS / INAP-G, Ferme du Moulon 91190 Gif-sur-Yvette, France

Submitted to: Genetical Research Running title: Reducing variance of genomic composition in backcrossing

Proofs to be sent to: Bertrand Servin UMR de G´en´etique V´eg´etale INRA / CNRS / UPS / INAP-G, Ferme du Moulon 91190 Gif-sur-Yvette, France



Corresponding author. Telephone: +33(0)169332349. e-mail: [email protected]

1

Summary Marker-assisted introgression or backrossing is a widely used method to improve commercial lines or study the behaviour of particular genes in an homogeneous genetic background. In this scope, the return to recipient genome parent in the genetic background of individuals is a major objective of backcrossing. Selection on markers have been shown to be very useful to accelerate the rate of return to the recipient genome in backcrossing. In this study we show how much information markers give on the true genetic composition of individuals by deriving the variance and estimating the distribution of the genetic composition of individuals sharing a known genotype at markers. These derivations allow to infer the number of individuals carrying an ideal genotype at markers that must be produced to fullfill background selection objectives.

2

1.

Introduction

Backcrossing is a widely used method for the improvment of varieties. The use of molecular markers to increase selection efficiency in so-called markerassisted backcrossing (MAB) has been studied for a long time. Markers can be used to (i) assess the presence of donor type alleles at target genes (Melchinger (1990); Hospital and Charcosset (1997)) (ii) reduce the length of the intact donor segment retained around the target gene (Frisch and Melchinger (2001) ; Hospital (2001)) and (iii) accelerate the return to a fully recipient genotype on non-carrier chromosomes (e.g. Hospital el al. (1992), Visscher et al. (1996)). Basically, the principle of MAB is to define, prior to selection, an ideal genotype at markers (ideotype) and then to perform generations of backcrossing while selecting individuals on their genotype at markers to obtain the ideotype as fast and cheap as possible. In this scope, markers are usefull because they allow to estimate the genomic composition (Π) of individuals carrying known genotypes at markers. Hence, a key point in marker-assisted backcrossing is to be able to define the optimal positions and number of markers to use so that the genomic composition of the individuals obtained by selection on markers only meets the expected return to recipient genotype. In MAB, the return to the recipient parent genome is a major objective of selection. To which extent this return can depend on the difference between agronomical performance of the starting lines (donor and recipient parent). For example, if the donor line is of poor agronomical performance and only used to bring a peculiar aptitude, such as resistance to diseases, the selection

3

objective on Π is high. On the other hand, if the donor parent has a good agronomical performance, the selection objective on Π might be lower. In order to ensure a given return to the recipient genome on the genetic background, markers to be used for selection are chosen and then individuals which carry homozyous recipient genotype at these markers are selected. This selection is particularly efficient on chromosomes that do not carry target genes (non-carrier chromosomes). The optimal number and positions of markers on non-carrier chromosomes has been studied particularly first by Visscher (1996) and then by Servin and Hospital (2002). Visscher (1996) has developped a method to compute the part of the variation in the genomic composition of individuals explained by markers. His method allows evaluation of how much information markers give regarding the genomic composition (Π) of individuals. However, this method assumes no selection on markers as the effect of selection is indeed hard to cope with in such studies. Servin and Hospital (2002) used a different optimization criterion to compute the optimal positions of markers on non-carrier chromosomes. They assumed selection on markers is very efficient so that individuals carrying a fully recipient genotype at markers are easy to obtain. Hence, from the known genotype at markers, the principle of their method is to compute the b of individuals and to find positioning of average genomic composition (Π) b This allows to take parmarkers corresponding to the maximal value of Π. b is computed conditionnaly on tially into account the effect of selection as Π the success of selection. However, their method does not take into account the variance of Π, which would give information on the probability that the selected individuals have actually a sufficient return to recipient genome Π. 4

Here, we derive a method to compute the variance of the genomic composition of individuals sharing a known genotype at markers. We will denote the standard deviation of Π (SDΠ ). We then derive a method to estimate the distribution function of Π (fΠ ) and the cumulative distribution of Π in individuals carrying the ideotype at markers. This allows to estimate the number of individuals that must be produced to fullfill background selection objectives.

2. (i)

Methods Mean and Variance of Π

The average genomic composition when no background selection is performed (π) is well-known: the proportion of donor genome is halved at each generation of backcrossing. A method to compute the variance in π has been described by Hill (1993). here we will consider the genomic composition of an individual given its genotype at markers (Π) has the proportion of recipient genome in its genetic background. We will use a similar method as Hill’s and follow his (1993) notation as closely as possible. Let t denote the generation of backcrossing; t = 1 being the first backcross. Let M be a known (t)

genotype at markers. Let Zi|M denote wether the ith locus from an individual of genotype M at markers at generation t originated from the recurrent (t)

(t)

parent (Zi|M = 1) or the donor parent (Zi|M = 0). (t)

The mean of Zi|M is: (t)

(t)

E(Zi|M ) = P (Zi|M = 1) 5

(1)

(t)

were P (Zi|M = 1) is the probability that locus i originates from the recurrent parent, given information on markers. (t) Let Z¯M be the genomic contribution to the total genome from the recur-

rent parent in an individual of genotype M at markers (in proportion of the genome size). Considering k loci on the genome, we have: k

1 X (t) (t) Z =Π Z¯M = k i=1 i|M

(2)

We compute the expected proportion of recipient genome of a genotype M at markers as: k

(t) E(Z¯M )

1 X (t) =E Z k i=1 i|M =

k 1X

k

! (3)

(t) b P (Zi|M = 1) = Π

i=1

Assuming a continuous distribution of loci on a chromosome of length L, with map positions zi , equation (3) becomes: (t) E(Z¯M )

1 = L

ZL

(t)

P (Zi|M = 1)dzi

(4)

0

When considering the ideotype at markers, this proportion is maximized when markers are located at the optimal positions described in Servin and Hospital (2002). The corresponding variance has not been calculated yet. This variance is: k

V

(t) ar(Z¯M )

1 X (t) = V ar Z k i=1 i|M h

(t)

!

(t)

= Es cov Zi|M , Zj|M 6

(5) i

= (SDΠ )2

where Es denotes a mean over all pairs of loci. Again assuming a continuous distribution of loci on a chromosome of size L, equation (5) becomes: h

Es cov 1 L2

(t) (t)  Zi|M , Zi|M

ZL ZL 0

i

= (6)

(t) P (Zi|M

=1,

(t) Zj|M

= 1) −

(t) P (Zi|M

=

(t) 1)P (Zj|M

= 1)dzi dzj

0

The probabilities could be calculated using the derivations of Visscher and Thompson (1995). But, with as few as one marker on the chromosome, the (t)

(t)

closed form for the integration of the joint probability P (Zi|M = 1 , Zj|M = 1) is hardly derivated. However, these probabilities can be numerically computed using the computer program MDM (Servin et al., 2002).

(ii)

Distribution of Π

Together with the mean and variance of Π we are interested in the distribution of Π values in individuals presenting the ideotype at markers. The distribution function of Π, fΠ , is not known. However, we can perform a Monte Carlo integration of fΠ by drawing values from the distribution of Π with computer simulations as follows. We simulated chromosomes of 100 centiMorgans carrying markers and many evenly spread loci on the chromosomes which allow to assess their true genomic composition (Π). After a given number of generations, we stopped the simulations and kept only chromosomes carrying fully recipient genotype at markers. This allows to estimate fΠ and the cumulative distribution function of Π (ΦΠ ), that is the

7

probability: ΦΠ (α) = P (Π > α) for 0 ≤ α ≤ 1

3.

(7)

Results and Discussion

Using the MDM program, we computed the variance in the genomic composition of individuals presenting a fully recipient genotype at markers. Markers were placed as described in Servin and Hospital (2002) on a chromosome of 100 centiMorgans. Table 1 around here b and SDΠ of individuals presenting a fully recipient genoTable 1 shows Π type at markers on a non-carrier chromosome of 100 centiMorgans. Figures are shown for two, three and four markers per non-carrier chromosome (m) and for two, three and four backcross generations (BC). The tabulated values are successively: • The mean proportion of recipient genome in non-selected populations (π) and the corresponding standard deviation (SDπ ), computed using the formula from Hill (1993). • The optimal positioning of markers (d∗ ) computed as in Servin and Hospital (2002) • The mean proportion (Π) of recipient genome on the chromosome given it carries a fully recipient genotype at markers, and the corresponding standard deviation (SDΠ ). 8

Table 1 shows that the standard deviation in the true genomic composition of genotypes fully recipient at markers decreases with increasing the number of generations. This is expected as when performing more backcross generations, individuals tend to be all equivalent with a proportion of recipient genome of about 100 %. This standard deviation also decreases with increasing the number of markers, showing that using more markers leads to a better estimation of the true genomic composition of individuals, which is also expected. If we use few markers per non-carrier chromosome, selection on markers will be very efficient. However, the values of SDΠ in table 1 shows that the actual rate of return Π of an individual obtained at the end of the selection process might be quite lower than the average rate of return b In order to cope with this problem, it is possible to to recipient genome Π. use more markers per non-carrier chromosome, but this will concomitently increase the cost of the backcross program. In order to choose precisely the number of markers that must be genotyped on non-carrier chromosomes, we used and approach based on the estimation of the distribution of Π values in individuals carrying the ideotype at markers. At the end of a MAB program, a population of individuals sharing the same ideal genotype at markers is obtained. To complete the program, it is possible to select the best individual inside this population. As individuals share the same genotype at markers, the selection can be done either by (i) genotyping more markers on the genetic background to have a more precise estimate of Π for each individual or by (ii) evaluating the agronomical performance of each individual. Note that, in the first case, as the return to the recipient genome of all individuals in the population is already quite 9

high, the number of additional markers to genotype before getting a sufficient discrimination power inside the population can be very large. The cost of the selection process at this last step depends on the number of individuals that must be evaluated. In order to limit that cost, we can estimate the minimal number of individuals (NI ∗ ) carrying a fully recipient genotype at markers to obtain so that at least one of them has a sufficient proportion Π in its genetic background. NI ∗ is computed from the probability ΦΠ that an individual carrying the ideotype at markers has a Π above a given value. If we could assume that b and SDΠ . the distribution of Π is gaussian, we could compute PI ∗ knowing Π Figure 1 shows the distribution of Π drawn with simulations performed as presented in the methods section. We can see that, unfortunately, the distribution of Π is strongly L-shaped (i.e. far from gaussian), particularly because we only retain the individuals that carry a fully recipient genotype at markers : typically, as seen on figure 1 only a few individuals have a low Π while most of them are nearly fully recipient even outside the markers. From fΠ , we can infer the cumulative distribution function of Π (ΦΠ ). An example of ΦΠ is given in figure 2, for a single non-carrier chromosome of 100 centiMorgans with three markers homozygous recipient at generation BC2 . We estimated similarly ΦΠ with two to four markers per non-carrier chromosomes, obtained at backcross generations BC2 to BC4 and for genomes composed of one, five or ten non-carrier chromosomes. Table 2 gives the corresponding NI ∗ , computed with a probability of success of 0.99, for selection objectives of 95%, 97% or 99% of recipient genome.

10

Table 2 around here Table 2 shows that NI ∗ is high at generation BC2 , except when using four markers by chromosome. For example, the NI ∗ needed to achieve a selection objective of 97% at generation BC2 , when considering a genome composed of ten non-carrier chromosomes, is 92. It is interesting to note that the corresponding average rate of return to the recipient parent genome is 97.5%. b is not always a sufficient criterion to choose the number This shows that Π of markers to use for background selection, because SDΠ is large. NI ∗ is low at generation BC2 when using four markers per non-carrier chromosome. However, obtaining NI ∗ individuals with fully recipient genotypes at four markers per non carrier chromosome as soon as generation BC2 is difficult and might necessitate huge population sizes. For more advanced backcross generations BC3 and BC4 , the results presented in table 2 show that if the number of non-carrier chromosomes is one or five, NI ∗ is close to ten, whatever is the selection objective. Hence, when the number of non-carrier chromosomes is low, the whole genetic background can be controlled successfully with only two markers by Morgan. However, when considering ten non-carrier chromosomes, using more than two markers is mandatory to obtain a return to the recipient parent genome above 97% while limiting the cost of the last selection step. Using only two markers per non-carrier chromosome is thus only valid if the selection objective on Π and the number of non-carrier chromosomes are low. The choice between using three of four markers per non-carrier chromosome for a large genome (ten non-carrier chromosomes) is not obvious. NI ∗

11

is larger when using only three markers per non-carrier chromosome, but these individuals are easier to obtain by selection on markers than when using four markers. Hence, using three markers will limit the genotyping cost at early generations of backcross but will imply a high cost to select the best individual at the end of the backcross program. On the other hand, using four markers per non-carrier chromosome will increase genotyping cost at all generations, but necessitate less expenses to identify an individual that meets selection objective at the end of the MAB program.

Aknowledgments The author wishes to thank Steve Openshaw and Fr´ed´eric Hospital for useful comments and help with the manuscript.

References Frisch M. and Melchinger, A.E., 2001, The length of the intact donor chromosome segment around a target gene in marker-assisted backcrossing. Genetics 157: 1343–1356. Hill, W.G., 1993 Variation in genetic composition in backcrossing programs. J. Hered. 84: 212–213. Hospital, F., C. Chevalet and P. Mulsant, 1992 Using markers in gene introgression breeding programs. Genetics 132: 1199–1210. Hospital, F., and Charcosset A., 1997 Marker–assisted introgression of quantitative trait loci. Genetics. 147: 1469–1485.

12

Hospital, F., 2001, Size of donor chromosome segments around introgressed loci and reduction of linkage drag in marker-assisted backcross programs. Genetics 158: 1363–1379 Melchinger, A. E., 1990 Use of molecular markers in breeding for oligogenic disease resistance. Plant Breeding. 104: 1–19. Servin B, Dillmann C, Decoux G and Hospital F, 2002 MDM : a program to compute fully informative genotype frequencies in complex breeding schemes. J. Hered. 93: 227-228. Servin B and Hospital F, 2002, Optimal positioning of markers to control genetic background in marker-assisted backcrossing. J. Hered. 93: 214– 216 Stam P, 2003, Marker-assisted introgression: speed at any cost ? Proceedings of the Eucarpia Leafy Vegetables Section Meeting : 117–123. Eds. Th.J.L. van Hitum, A. Lebeda, D. Pink, J.W. Schut. Visscher, P. M., and Thompson, R., 1995 Haplotype frequencies of linked loci in backcross population derived from inbred lines. Heredity 75: 644–649. Visscher, P. M., Haley, C.S., and Thompson, R., 1996 Marker assisted introgression in backcross breeding programs. Genetics 144: 1923–1932. Visscher, P. M., 1996 Proportion of the variation in genomic composition in backcrossing programs explained by genetic markers. J. Hered. 87: 136–138.

13

m

BC

π (%)

SDπ (Hill, 1993)

d∗ (cM)

Π (%)

SDΠ (%)

2

2

87.5

15.37

21.4

97.5

4.8

3

93.75

11.02

22.9

98.4

3.8

4

96.9

7.57

24.0

99.0

2.9

2

87.5

15.37

11.0

98.8

3.0

3

93.75

11.02

12.6

99.2

2.5

4

96.9

7.57

14.0

99.4

2.0

2

87.5

15.37

6.5

99.3

2.1

3

93.75

11.02

7.8

99.5

1.8

4

96.9

7.57

9.0

99.6

1.4

3

4

Table 1: Expected proportion of recipient genome (Π) on a chromosome of 100 cM and its corresponding standard deviation (SDΠ ) on a genotype recipient at all m markers for different backcross generations (BC). Theoretical proportion of recipient genome on the chromosome when no selection is performed (π), its corresponding standard deviation (SDπ ) and optimal positioning of markers (d∗ from Servin and Hospital 2002) are recalled.

2

2

2

3

4

2

4

2

2

3

2

2

2

2

3

3

3

3

4

Π ≥ 97%

Nc = 1

2

3

3

3

3

4

3

4

5

Π ≥ 99%

3

3

4

4

4

5

5

7

12

Π ≥ 95%

4

4

5

5

6

8

6

10

19

Π ≥ 97%

Nc = 5

4

7

8

6

9

14

8

14

30

Π ≥ 99%

4

5

6

6

7

10

9

16

41

Π ≥ 95%

6

7

10

9

13

21

13

28

92

7

15

21

13

24

53

21

53

216

Π ≥ 97% Π ≥ 99%

N c = 10

which is obtained the genotype at markers (BC).

of non-carrier chromosomes (N c), number of markers per non-carrier chromosome (m) and backcross generation at

true proportion of recipient genome (Π) above a given threshold ( 95%, 97% and 99%) as a function of the number

Table 2: Number of individuals to produce in order to have a probability of 0.99 to get an individual presenting a

4

2

2

4

2

3

3

3

3

2

2

Π ≥ 95%

BC

m

Figure Legends Figure 1 The distribution (fΠ ) of the recipient genome content (Π) in a population of individuals presenting homozygous recipient genotypes at 3 markers at generation BC2 on a non-carrier chromosome of 100 centiMorgans. Positions of markers are as described in Servin and Hospital (2002). Results are based on 65000 replicates. Figure 2 The cumulative distribution (ΦΠ ) of the recipient genome content (Π) in a population of individuals presenting homozygous recipient genotypes at 3 markers at generation BC2 on a non-carrier chromosome of 100 centiMorgans. Positions of markers are as described in Servin and Hospital (2002). Results are based on 65000 replicates.

Figure 1:

Figure 2:

Quatri` eme partie Optimisation de programmes de pyramidage de g` enes

XXXVII

Pr´ esentation Cette partie de ma th`ese est compos´ee de deux articles dont un a ´et´e soumis `a la revue Genetics (Servin et al., soumis) et l’autre est en pr´eparation (Servin et al., in prep.). Ces articles traitent de l’optimisation de programmes de pyramidage de g`enes, c’est-`a-dire de d´eterminer les meilleures successions de croisements permettant de cumuler des g`enes port´es par plusieurs parents dans un seul g´enotype (appell´e id´eotype). La d´emarche que nous avons suivie a d’abord ´et´e de d´eterminer un mod`ele de repr´esentation pour les sch´emas de pyramidage de g`enes. Nous avons consid´er´e que la population de parents contenant des g`enes `a cumuler ´etait constitu´ee de g´enotypes portant chacun un seul g`ene d’int´erˆet `a l’´etat homozygote. Les sch´emas de s´election permettant de cumuler ces g`enes dans l’id´eotype sont des successions de croisements entre paire d’individus. Les parents de la population de d´epart ne sont utilis´es qu’une seule fois au cours du pyramidage, c’est `a dire que chacun des g`enes n’est introduit dans le sch´ema qu’une seule fois. Une fois un g`ene introduit dans le sch´ema, il est ensuite conserv´e au cours des croisements successifs. Ayant d´efini ce mod`ele de repr´esentation g´en´erale des sch´emas de pyramidage de g`enes (que nous appellons ´egalement p´edigrees), nous avons construit un algorithme permettant de g´en´erer tous les p´edigrees possibles qui permettent d’obtenir l’id´eotype. Une fois ces p´edigrees construits, il est possible d’´evaluer leur efficacit´e sur diff´erents crit`eres, comme par exemple les tailles minimales de population n´ecessaires `a chaque ´etape du p´edigree. En utilisant ce crit`ere, nous avons montr´e que notre mod`ele de repr´esentation permettait de d´efinir des sch´emas de pyramidage de g`enes qui sont moins coˆ uteux et moins longs que ceux utilisant la m´ethode existante auparavant (Hospital et al. , 2000). L’ensemble de ces r´esultats est pr´esent´e dans l’article Servin et al. (soumis). L’int´erˆet de cette d´emarche est qu’elle fournit un cadre g´en´eral pour l’´etude de programmes de pyramidage de g`enes. Il s’agit d’une premi`ere ´etape pour d´evelopper la th´eorie sur ce sujet, qui reste encore aujourd’hui largement inexplor´ee. Cette question de recherche est donc ouverte `a de nombreuses perspectives.

Perspectives Il faut tout d’abord noter que la d´emarche pr´esent´ee dans Servin et al. (soumis) repose sur une ´enum´eration de tous les p´edigrees possibles. Or, le nombre de p´edigrees augmente tr`es rapidement avec le nombre de g`enes pr´esents dans la population de d´epart. Ceci pose un probl`eme lors de l’impl´ementation informatique de la m´ethode (pour des raisons de limitations de ressources de l’ordinateur), et il n’est pas possible de traiter le cumul de plus de 9 g`enes simultan´ement. Il est donc n´ecessaire de r´efl´echir `a des optimisations des algorithmes utilis´es pour augmenter leur domaine d’application. Par ailleurs, les r´esultats obtenus avec cette m´ethode sont pour l’instant relativement descriptifs. En utilisant l’algorithme pr´esent´e dans Servin et al. (soumis), nous pouvons d´eterminer le meilleur p´edigree possible, mais nous ne pouvons pas, pour l’instant, le d´eterminer a priori. L’´etude de cas particuliers devraient permettre de d´efinir des r`egles XXXIX

plus pr´ecises permettant de d´eterminer a priori pourquoi certains p´edigrees sont meilleurs que d’autres. Il serait par exemple int´eressant d’´etudier comment la r´epartition des g`enes d’int´erˆet sur le g´enome (r´epartition homog`ene ou g`enes contitu´es en clusters distants) influence la succession optimale des croisements `a effectuer. L’article en pr´eparation inclus dans cette partie est un exemple de tentative de d´efinition de r`egles g´en´erales d’optimisation dans un type de pedigree particulier : les p´edigrees en cascade. Dans des p´edigrees en cascade, `a chaque g´en´eration est effectu´e un croisement entre un g´enotype porteur des g`enes d´ej`a cumul´es et un des parents de la population de d´epart. Dans ce type de p´edigree, les g`enes sont donc cumul´es un par un. L’article Servin et al. (in prep.) s’attache `a rechercher une r`egle g´en´erale d’optimisation de ces p´edigrees pour trouver celui qui maximise la probabilit´e d’obtenir l’id´eotype. Il semble que lorsque les g`enes sont r´epartis de fa¸con homog`ene sur les chromosomes, les p´edigrees cumulant les g`enes de proche en proche au cours des g´en´erations sont les meilleurs. Cet article montre ´egalement la complexit´e de la d´efinition de r`egles g´en´erales d’optimisation. Ceci est particuli`erement dˆ u au fait qu’un pedigree doit ˆetre pris dans son ensemble : le croisement qui est effectu´e `a une g´en´eration influence la valeur du pedigree sur toutes les g´en´erations suivantes. La d´efinition de r`egles g´en´erales d’optimisation est donc longue et relativement laborieuse, et fournira mati`ere `a r´eflexion. L’int´erˆet de la d´efinition de telles r`egles r´eside d’une part dans la compr´ehension des m´ecanismes influen¸cant le coˆ ut d’un programme de construction de g´enotypes mais ´egalement dans la mise au point d’algorithmes de d´etermination des meilleurs p´edigrees plus performant, permettant de traiter plus de g`enes simultan´ement. La recherche de ces r`egles et l’optimisation de la recherche du meilleur p´edigree sont donc des questions de recherche li´ees qui n´ecessitent de futurs d´eveloppements.

XL

Towards a theory of marker-assisted gene pyramiding Bertrand Servin∗ , Olivier C. Martin† , Marc M´ ezard† and Fr´ ed´ eric Hospital∗ ∗

UMR de G´ en´ etique V´ eg´ etale INRA / CNRS / UPSud / INAP-G, Ferme du Moulon, 91190 Gif-sur-Yvette, France



Laboratoire de Physique Th´ eorique et Mod` eles Statistiques, batiment 100, Universit´ e Paris-Sud, 91405 Orsay cedex, France

Draft version: 18th October 2003

Submitted to Genetics

Running head Optimization of gene pyramiding

Keywords marker-assisted selection, gene pyramiding, genotype building, selection theory, pedigree design

Corresponding author Fr´ed´eric Hospital Station de G´en´etique V´eg´etale, INRA/UPS/INA-PG Ferme du Moulon 91190 GIF SUR YVETTE France Tel: (33)(1) 69 33 23 36 Fax: (33)(1) 69 33 23 40 E-mail: [email protected]

2

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Abstract We investigate the best way to combine into a single genotype a series of target genes identified in different parents (gene pyramiding). Assuming individuals can be selected and mated according to their genotype, the best procedure corresponds to an optimal succession of crosses over several generations (pedigree). For each pedigree, we deduce the probability of success from the known recombination fractions between the target loci, as well as the number of individuals (population sizes) that should be genotyped over successive generations until the desired genotype is obtained. We provide an algorithm that generates and compares pedigrees based on the population sizes they require and on their total duration (in number of generations), in order to find the best gene pyramiding scheme. Examples are given for eight target genes, and compared to a reference genotype selection method with random mating. The best gene pyramiding method combines the eight targets in three generations less, with as many genotyped individuals as the reference method.

Introduction Recently there have been advances in the mapping of genes involved in the variation of quantitative traits, through quantitative trait loci (QTL) mapping experiments and analysis of genomic data. Such studies on complex traits should lead to the identification of a great number of genetic factors responsible for the heritable variation of these traits. Furthermore, once these genetic factors are mapped, they can be controlled by molecular markers and the corresponding genotypes of individuals can be assessed easily. As a consequence, the identification of individuals carrying favorable alleles at these loci will provide genetic material for the development of new improved varieties.

3

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Most theoretical work on the application of marker–QTL associations in selection has focused on using markers to estimate an individual’s breeding value more reliably than when using its phenotype. In practice, a selection index is generally built based on both the marker score and the phenotypic value of individuals (e.g. Lande and Thompson (1990) , Hospital et al. (1997) , Moreau et al. (1998)); individuals are then selected before being mated at random. Such strategies of marker-assisted selection (MAS) aim at increasing population (or line) mean genetic value for one or more traits. Obviously, increasing genetic value rests on increasing the frequency of favorable genes controlling that trait. However, deciphering the genetic architecture of quantitative traits is not the primary objective of MAS, nor a prerequisite for its success. In this view, MAS clearly belongs to the field of statistical quantitative genetics, established long before the advent of molecular genetics. Indeed, recent developments on increasing the efficiency of MAS indicate that a better estimate of breeding values is obtained by incorporating all markers in the molecular score (Lange and Whittaker (2001), Meuwissen et al. (2001)), which is in some way opposite to the fine mapping of QTL. Surely, better methods of gene mapping and estimation of breeding value through markers are needed still and deserve more work. It must be noted however that there is another aspect of MAS that also deserves more theoretical developments. If we know the locations of a series of genes of interest (hereafter referred to as target genes), the selection process may be reduced to a “building blocks” problem. What is the “best” way to do the gene pyramiding? Could optimal pairwise mating of individuals based on their known genotype at target loci be more efficient than selecting individuals on a molecular score and then mating them randomly? These are the questions we address in the present paper. Note that this problem is more a matter of simple Mendelian genetics extended to multiple loci (probabilities of recombination between known genes) than one of quantitative genetics and statistics. 4

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Suppose an ideal genotype (ideotype) at a series of target genes can be defined prior to selection (the ideotype has favorable alleles at all loci of interest) but that it is not present in the starting population. The marker assisted selection process is then reduced to genotype building where individuals are selected solely based on their genotype at the target loci (or at linked markers controlling the targets), the goal being to obtain the ideal genotype as cheaply and quickly as possible. The design of optimal breeding schemes aimed at cumulating many genes is a complex problem which few authors have studied so far. When several favorable genes are originally hosted by only two different parents, the simplest strategy involves the production of an F2 , F3 , Recombinant Inbred Lines (RIL) or Doubled-Haploid (DH) population. Then, the population is screened based on molecular markers for individuals homozygous at the requested loci. In this context, van Berloo and Stam (1998) have considered a set of identified QTL, each controlled by two flanking markers, and studied selection in RIL populations based on flanking markers to produce the best hybrid. If all genes cannot be fixed in a single step of selection, it is necessary to cross again selected individuals with incomplete, but complementary, sets of homozygous loci (Charmet et al. 1999). However, such strategies are limited to small numbers of target loci because the population size necessary to fix the target genes increases exponentially with the number of target loci. To cumulate more loci in a single genotype by selection on markers, Hospital et al. (2000) proposed a Marker Based Recurrent Selection (MBRS) method using a QTL complementation strategy in a randomly mating population. When evaluating this approach using simulations with 50 detected QTL in a population of 200, they found that the frequency of favourable alleles went up to 100 % in ten generations when markers were located exactly on the QTL, but only up to 92 % when marker-QTL distance was 5 cM. The reduced efficiency in the latter case is due to the probability to “loose” the QTL during the breeding scheme because of recombination between the markers and the QTL. This probability increases with the duration of the breeding scheme because 5

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

of the accumulation of meioses; hence, it is important to cumulate and fix the target genes as rapidly as possible. Hospital et al. (2000) concluded that the optimization of pairwise crosses between selected individuals should be the most efficient way to decrease the duration of the breeding scheme at constant cost. In this study, we present a general framework to optimize breeding schemes aimed at accumulating identified genes from multiple parents into a single genotype. We will describe an algorithm that allows one to build every possible succession of pair crosses leading to the target genotype. We will show how to compute the probabilities of gene transmission through these crosses and investigate the duration (in terms of number of generations) and the cost (in terms of population sizes) needed to produce the ideal genotype.

Methods Definitions We want to cumulate into a single genotype genes that have been identified in multiple parents. For this study, we will assume that we have n loci of interest and a set of n founding parents labeled {Pi , i ∈ [1 . . . n]}, with Pi being homozygous for the favorable allele at the i th locus, and homozygous for unfavorable alleles at the remaining n − 1 loci. We assume that the recombination fractions between the loci are known. We want to derive the “ideal” genotype (called ideotype) which is homozygous for the favorable allele at all n loci. To obtain the ideotype, one must describe a way of crossing the founding parents and their offspring to pass on all the favorable alleles to this ideotype. We will call a particular set of crosses allowing this transmission a breeding scheme. We will assume that every founding parent is involved in only one cross in the breeding scheme. Figure 1 around here

6

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

As can be seen in figure 1, we can distinguish two parts for the breeding scheme. The first part is called a pedigree and is aimed at cumulating one copy of all target genes in a single genotype (called root genotype ). The second part is called the fixation steps, and is aimed at fixing the target genes into a homozygous state, that is to derive the ideotype from the root genotype. Pedigree A pedigree can be represented by a binary tree with n leaves corresponding to the n founding parents (figure 1). A pedigree has therefore n − 1 nodes. Figure 2 around here Each node of the tree (except if it corresponds to a founding parent) is called an intermediate genotype, and has two parents (figure 2). So we distinguish between founding parents appearing at the top of the pedigree (leaves of the tree) and (ordinary) parents involved in crosses in the rest of the breeding scheme. Obviously, each intermediate genotype becomes a parent in the next cross. More importantly, note that an intermediate genotype is not an arbitrary offspring of a given cross; rather it is a particular genotype selected among these offsprings such that all parental target genes are present. The part of a pedigree above a given node (i.e., leading to a given intermediate genotype) is called a sub-pedigree. Intermediate genotypes An intermediate genotype is noted H(s1 )(s2 ) where s1 is the subset of target genes inherited from one parent and s2 from the other. Note that, within a subset, the favorable alleles are in coupling phase (they were carried by the same gamete), while favorable alleles from different subsets are in repulsion phase (carried by different gametes). Each intermediate genotype must produce and pass on to its offspring a gamete s carrying all the favorable alleles in s1 and s2 (so that s = s1 ∪ s2 ). Fixation Steps We consider the fixation steps separately because it is not a matter for optimization here. Rather, it is a matter for breeding techniques, depending on particular 7

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

conditions that are the same for all pedigrees. Hence, in our work we will optimize the pedigree part of the breeding scheme, while the fixation steps follow a fixed protocole and have the same duration regardless of the root genotype. Nevertheless, let us briefly consider the way these steps can be implemented in practice as well as their impact on the efficiency of the breeding scheme. One possible procedure for the fixation steps is to generate a population of doubledhaploids from the root genotype. In this case, a population of gametes is obtained from this genotype and their genetic material is doubled. This leads to a population of fully homozygous individuals, amongst which the ideotype can be found. Using this method, the ideotype can be reached in just one additional generation after the root genotype is obtained. However, producing large populations of doubled-haploids is possible in only a few plant species. Thus the fixation steps we implement for our study are as follows. • First, obtain a genotype carrying all favorable alleles in coupling (namely, H(1...n)(B) ) by crossing the root genotype with a blank parent (denoted H(B)(B) ) containing none of the favorable alleles. This garantees that the linkage phase of the offspring is known and that the H(1...n)(B) genotype can be identified without ambiguity. • Second, self H(1...n)(B) to give the ideotype in a single generation. With this procedure, the ideotype is reached in two additional generations after the root genotype. This means that the fixation steps correspond to two nodes and therefore that the breeding scheme has a total of n − 1 + 2 = n + 1 nodes. A possible alternative to crossing with a blank parent is a cross with one of the founding parents. In this case the linkage phase is still known and one of the target genes (that provided by the founding parent) already in the homozygous state, thus improving the fixation. The choice of the parent to use may be subject to particular considerations depending on the value of the founding parents, the location of the loci, etc. and was 8

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

therefore not considered in this study. Another alternative to these methods would be to self the root genotype directly in order to obtain the ideotype. However, selfing the root genotype breaks the linkage between favorable alleles, and in general one cannot identify these breaks because linkage phase is rarely known in selfed populations. Selfing the root genotype and the following offspring would therefore be counter productive and span too many generations when compared to the methods previously cited. Pedigree height The number of generations a pedigree spans is called the height of the pedigree, denoted h. This height varies with the pedigree considered. Recalling that the reference fixation steps considered in this work span two generations, the complete breeding scheme spans h + 2 generations. A pedigree is of maximum height when just one cross is performed at each generation (involving an intermediate genotype H and a founding parent). We will call this type of pedigree a cascading pedigree in the rest of the paper. As only one new gene is cumulated at each generation, the height of a cascading pedigree is n − 1. Conversely, a pedigree is of minimum height when the maximal number of crosses are performed at each generation. It is easy to show that this minimum height is a + 1 where n = 2a + b, 0 < b ≤ 2a , and a and b are integers. Finally, we get that the height h of a pedigree cumulating n genes satisfies

dLog2 (n)e ≤ h ≤ n − 1

(1)

where dxe denotes the smallest integer larger or equal to x. Number of pedigrees The number of pedigrees cumulating n genes is the number of binary trees with n labeled leaves, a problem studied many years ago by (Rohlf 1983). Here we show another way to calculate this number. The root genotype of a pedigree cumulating n target genes comes from the cross of two parents carrying respectively p and 9

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

n − p target genes, where (1 ≤ p ≤ n − 1). Let A(p) be the number of sub-pedigrees cumulating p specified genes. Summing up over all possible values of p, we can compute the number A(n) of pedigrees cumulating n genes via:   1 X n A(n) = A(p) A(n − p) . 2 1≤p≤n−1 p

(2)

The factor 1/2 is there to ensure that the crossing of two given parents is counted only once. This recurrence can be solved (see Appendix), and leads to: n Y A(n) = (2p − 3) = (2n − 3)!!

(3)

p=2

for the total number of pedigrees cumulating the n genes. Table 1 gives some numerical values of A(n); clearly the total number of pedigrees increases very fast with the number of loci considered. This shows that for more than 5 genes, a hand enumeration of all pedigrees is hopeless and so a computerized approach is mandatory. We will now describe an algorithm to build up all these pedigrees. Because of the fast increase of A(n) with n, the number of loci that can be treated will necessarily be quite limited, even when running such an algorithm on a very powerful computer. A simple algorithm to build all possible pedigrees To obtain a pedigree of height h, we can merge two sub-pedigrees, one of height h − 1 and one of height h0 ≤ h − 1. Note that, as we demand that founding parents are involved in only one cross in a pedigree, we only merge sub-pedigrees whose root genotypes have no target genes in common. From this, we infer an iterative process to build all possible pedigrees for cumulating n genes. We consider the founding parents as pedigrees of null height (h = 0). Assuming we have constructed all pedigrees of height less or equal to h, we generate all pedigrees of height h + 1 as follows: 1. Examine all distinct pairs of sub-pedigrees {P1 ,P2 } of respective heights h1 and h2 , with h1 = h and h2 ≤ h 10

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

2. If the root genotypes of P1 and P2 do not have any target genes in common, merge P1 and P2 to form a sub-pedigree P 3. If P cumulates all n genes, store it; otherwise add it to the list of sub-pedigrees of height h + 1 This construction can be iterated until h + 1 reaches its maximum, namely h + 1 = n − 1 (see equation 1). In figure 3 we sketch the progress of this algorithm in the case of 4 genes. Figure 3 around here Gene transmission probabilities through a pedigree Let us focus on a particular pedigree node, corresponding to an intermediate genotype H(s1 )(s2 ) (not a founding parent). Based on the recombination fractions between loci, we can compute the probability that H(s1 )(s2 ) passes on to its offspring the set of genes s which is the union of s1 and s2 . If we denote by ν(s) the total number of genes in the set s, we have ν(s) = ν(s1 ) + ν(s2 ). Let {ai } be the genes in set s ranked according to their position on the genetic map, so that s = (a1 , a2 , . . . , aν(s1 )+ν(s2 ) ). Let rx,y be the recombination fraction between x and y. The probability that a gamete of H(s1 )(s2 ) contains the set s of genes is: P (H(s1 )(s2 )

ν(s)−1 1 Y → s) = π(i, i + 1) 2 i=1

(4)

where π(i, i + 1) = rai ,ai+1 if genes ai and ai+1 are in different subsets and π(i, i + 1) = (1 − rai ,ai+1 ) otherwise. Note that there might be other target genes on the map, located between the ai ’s, but not belonging to the set s; recombinations between those genes do not matter here. As an example illustrating formula 4, consider the genotype H(13)(256) . The probability that it passes the set (12356) is 1 P (H(13)(256) → (12356)) = (r1,2 )(r2,3 )(r3,5 )(1 − r5,6 ) 2 11

(5)

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Knowing these probabilities, the overall probability to obtain the root genotype of a given pedigree is the product, over all the pedigree’s other nodes, of the probabilities calculated as in equation 4. Minimum population sizes necessary to obtain the ideotype Let’s call pf and pm the probabilities computed as in (4) that each parent of a given node passes on its particular subset of genes. From these probabilities we can compute the population size N needed to get the intermediate genotype at this node with a probability of success γ. The probability that none of the N offsprings has the right genotype is (1 − pf pm )N ; identifying this with 1 − γ gives N=

ln(1 − γ) ln(1 − pf pm )

(6)

where ln denotes the natural logarithm. From (6), we can compute the population sizes required at each node. Now the overall probability of success of the pedigree is the product of the probabilities of success at each of its nodes. Similarly, we can compute the population sizes required for the fixation steps. If all nodes are associated with the same probability of success γ as considered here, then the overall probability of success of the breeding scheme is γ n+1 . The sum of all population sizes needed in the breeding scheme (pedigree and fixation steps) is denoted by Ntot . The largest of the population sizes to be handled at any node or step during the whole breeding scheme is denoted by Nmax .

Results We have developed a computer program implementing the algorithm described in the Methods section that builds all pedigrees leading to the ideotype for a given number n of genes. Then, given the ri,j values, the program determines the gene transmission probabilities and the cumulated population size Ntot for each pedigree. We now apply this algorithm to a set of particular cases to illustrate the results obtained with our method. 12

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Cumulating 4 genes: a case study Using our program, we have generated the (2n− 3)!! = 15 possible pedigrees for cumulating 4 genes located on a single chromosome. We assume that the recombination fractions between adjacent loci are the same and correspond to 20 centiMorgans using the Haldane (1919) mapping distance. As the recombination fraction is the same for all pairs of adjacent loci, some pedigrees have the same transmission probability and cumulated population size. In that case, we show only one example of pedigree per cumulated population size. Figure 4 shows the three pedigrees that necessitate the smallest Ntot , and for each gives the allelic transmission probabilities. The population sizes have been computed using a probability of success γ = 0.999 at each node, leading to an overall probability of success equal to 0.995 for the pedigree. The population size needed at each node is indicated in a box. The cumulated population size Ntot is also given. Figure 4 around here The figure 4.a shows a cascading pedigree. This pedigree spans 5 generations (h = n − 1 for the pedigree height, plus two generations for the fixation steps) and requires the smallest cumulated population size of all the pedigrees. The two other best pedigrees last 4 generations (h = Log2 (n) = 2 for the pedigree height plus two generations for the fixation steps). The pedigree that necessitates the next smallest Ntot is the one represented in figure 4.b. It cumulates loci 1 and 4 on one sub-pedigree and 2 and 3 on the other, before generating the H(1234)(B) genotype. The population sizes needed for this pedigree are large at all nodes when compared to the cascading pedigree. The pedigree represented in figure 4.c necessitates an even larger Ntot because a huge population size is needed to produce the root genotype H(12)(34) ; conversely, the population size needed to produce the H(1234)(B) genotype is much lower. We see here that cascading pedigrees are less expensive in terms of population sizes 13

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

when compared to other pedigrees. This can be understood from the fact that the node at the second generation of non-cascading pedigrees involves a genotype composed of two gametes that are both obtained by rare recombination events. As the recombination probabilities are quite low, the probability of obtaining the target genotype is very low. Hence, population sizes needed at this step are typically enormous. On the contrary, for cascading pedigrees, only one of the parental gametes requires a recombination event; hence the minimal population sizes needed at each step of a cascading pedigree are much lower than for other pedigrees. In our case with 4 loci, the cascading pedigree spans one generation more than other pedigrees but requires a much smaller Ntot ; hence, cascading pedigrees are a good choice. However, when more loci are to be cumulated, the difference in heights (i.e., in duration), between cascading pedigrees and other types of pedigrees becomes more important as is examplified below. Also, we will see soon that the efficiency of cascading pedigrees relative to other types of pedigrees depends on the method used to cross individuals and obtain the intermediate genotype at each node. Cumulating many genes We now examine a case with 8 loci to get a feeling for the qualitative behavior in the case of a larger number of target genes. We work again with a constant recombination fraction between adjacent loci corresponding to a Haldane mapping distance of 20 centiMorgans. Of interest are the cumulated population size (Ntot ), the greatest population size amongst all nodes (Nmax ) and the total number of generations needed to derive the ideotype. We shall examine these numbers for three breeding strategies. Reference method for comparison (MBRS) We take as a reference method the markerbased recurrent selection (MBRS) strategy proposed by Hospital et al. (2000). In a population the molecular score is computed as the number of target genes carried by each individual. To avoid fixation of unfavorable alleles because of linkage disequilibrium and drift, individuals are selected based on a ’QTL Complementation’ strategy which is shown 14

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

to be more efficient than simple “mass selection” on the molecular score. In their study, Hospital et al. (2000) started from a population in linkage equilibrium; here, we use a starting population composed of founding parents which is thus in complete linkage disequilibrium. To use MBRS as a reference method, we considered that its breeding scheme was complete when the ideotype was obtained in 99% of the simulations performed. We also assumed a constant population size throughout generations; if Ng is the population size at each generation, then the cumulated population size (Ntot ) is Ng times the number of generations. Naturally, increasing Ng leads to completing the breeding scheme in fewer generations. We found that with MBRS the breeding scheme did not complete in less than 7 generations when using realistic population sizes. Also, we did not consider more than 12 generations of MBRS because Ng was already small enough (70) for 12 generations. Pairwise Crossing 1 method (PWC1) Our second breeding strategy is simply to construct the ideotype by pedigree optimization as described in the Methods section. We refer to this strategy as P W C1 for Pairwise Crossing of the first type. Taking the pedigree with the lowest Ntot for each height, we show in Table 2 our results for pedigrees spanning from five to nine generations. The breeding scheme spanning five generations corresponds to a pedigree which is a perfectly balanced pyramid of height Log2 (8) = 3 where the maximum number of crosses are performed at each generation. It starts with the 8 founding parents; at the first generation four crosses are performed leading to four intermediate genotypes. At the second generation two crosses are performed and at the third generation a single one is. After these three generations there are the fixation steps that span two generations. The scheme spanning a total of nine generations comes from a cascading pedigree; it is the one that necessitates the smallest cumulated population size Ntot and has the smallest Nmax . For the breeding schemes spanning less than nine generations, Ntot and Nmax are higher. This can be explained in the same way as for the four loci case: when following a non-cascading pedigree, at least one intermediate genotype must be obtained that carries 15

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

two gametes, both of which are produced by rare recombination events. The probability of obtaining such an intermediate genotype is typically very low so the associated population size is quite large. On the contrary, cascading pedigrees never have high Nmax values. One sees from Table 2 that the optimal crossing with P W C1 always requires a smaller cumulated population size (Ntot ) than MBRS for a given number of generations. However, cumulated populations sizes with P W C1 are still not small, and do not decrease very rapidly with increased duration. Moreover, P W C1 requires larger Nmax than MBRS, except for 7 generations. Clearly, for schemes spanning 7 generations, P W C1 is a better choice than MBRS from any point of view and is thus preferred. Yet, it is harder to draw a general conclusion from the results in Table 2 for other durations. In fact, the choice of a breeding strategy must incorporate economical and practical considerations that are beyond the scope of the present paper. In particular, one has to consider: (i) the cost of genotyping (depending mostly on Ntot , though not only); (ii) the cost of pairwise crossings that might be more demanding than random mating depending on the species; and (iii) whether the limiting step at Nmax is feasible according to the genotyping facilities. As is often the case in breeding theory, a trade-off between duration and cost is observed here (lower cost for longer duration). However, using durations greater than 9 generations would take us out of this paper’s framework. More explicitly, considering pedigrees lasting more generations than the maximum given in equation 1 requires allowing for other kinds of pedigrees, involving for instance founding parents multiple times or the use of extra crosses when a given one fails. Such extensions of our hypotheses were not considered. Nevertheless, because of the large values of Ntot found in P W C1, we now investigate whether a modified crossing method can lower the population size needed (i.e., the cost) further so an optimal pedigree breeding clearly outcompetes MBRS. Pairwise Crossing 2 method (PWC2) Clearly the main bottlenecks of non-cascading pedigrees are their request for rare recombinations at some nodes, arising generally at ad16

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

vanced generations. To alleviate this problem, we now adopt a modified crossing procedure at each node which we call P W C2 for Pairwise Crossing of the second type. In this new strategy, we extend P W C1 and introduce a two-step hybridization procedure to derive intermediate genotypes. This is illustrated in figure 5. Suppose an intermediate genotype H 0 is to be obtained from the cross of two (non-founding) parents H1 and H2 . Rather than cross H1 and H2 directly, we first cross each separately to a blank parent. From the resulting offsprings, we select those individuals carrying all of their parent’s favorable alleles (necessarily in coupling). Then two such individuals are crossed to give H 0 . The key point with this two-step hybridization procedure is that the two gametes coming from a recombination can be selected independently. The efficiency of this strategy comes from the fact that the sum of the population sizes needed to obtain independently two gametes in separate crosses is generally much lower than the population size needed to obtain them jointly in a single cross. (Note that equation 6 involves the product of pf and pm .) Hence, the cost of obtaining genotypes from the crosses with the blank parents can be much lower than with the hybridization performed in the P W C1 strategy. Conversely, this two-step hybridization procedure has the drawback of adding an extra generation at each of the corresponding P W C1 nodes where it is used. (N.B.: if a founding parent is involved in a cross, we do not perform the two-step hybridization as it is never useful.) However, this drawback does not increase the total pedigree duration by the number of nodes, at least for pedigrees involving more than one node per generation (i.e., non cascading pedigrees). In fact, the additional number of generations is at most h − 1, because two-steps hybridization is not useful with founding parents. Hence, the total duration is less than doubled compared to the P W C1 method. The net effect is to favour pedigrees involving many nodes per generation (e.g., perfectly balanced pedigrees) compared to pedigrees involving few nodes per generations (e.g., cascading pedigrees). Because of this, the value of non-cascading pedigrees compared to cascading ones is renewed 17

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

as will be seen below. When cumulating eight loci using P W C2, we obtained results for breeding schemes spanning from six to nine generations (table 2). Compared to P W C1, the durations of the optimal breeding schemes are increased at most by two generations, but with the P W C2 strategy, the Ntot needed for pedigrees are reduced as can be seen in table 2. With P W C1, the breeding schemes lasting 9 generations corresponded only to cascading pedigrees. With P W C2, breeding schemes lasting 9 generations include both cascading and non-cascading pedigrees because, cascading pedigrees are not affected by P W C2 since a founding parent is involved at each cross. The best breeding scheme lasting 9 generations with P W C2 in table 2 does not correspond to a cascading pedigree and, has a lower Ntot than the P W C1 at same duration. Because P W C2 favors pedigrees having many nodes per generation, it is interesting to note that in table 2 the pedigree which requires the smallest cumulated population size is also the one that spans the fewest generations. Thus P W C2 fulfils both objectives of gene pyramiding: minimization of pedigree duration and of pedigree cost! Unless an even faster strategy is mandatory, e.g. for economic reasons, this pedigree using the 2 steps hybridization procedure is optimal. Recall that the populations sizes depend on the choice of a requested probability of success at each node (γ); population sizes given in table 2 were computed with a quite conservative probability of success for the three breeding strategies.

Discussion The present study described a general framework for the pyramiding of multiple genes into a single genotype. By combining these results with those available for various other aspects of marker-assisted selection (Dekkers and Hospital (2002)) it is now possible to 18

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

optimize complex breeding schemes incorporating molecular information. Such optimization is relevant not only to plant or animal breeding. The possibility to develop specific genotypes rapidly at low cost is also of general interest for fundamental studies on the genetic architecture of complex traits: validation of candidate genes or QTL effects, studies of gene by genetic background interactions, gene by gene epistatic interactions, etc. Interestingly enough, our study also links to other topics of population genetics not related to selection. For example, putting the pedigrees described here upside down, one can turn it into coalescence trees. The main modifications this brings are that for coalescence trees, a fixed probability is associated to each node, but branch lengths may vary. The link between these two areas is clearly visible here in the computation of the number of pedigrees. In our study, we made some simplifying assumptions on the genotype of the founding parents for the sake of demonstration. In particular, we supposed that founding parents were homozygous for the favorable allele at each target locus. However, in our framework, it is also possible to study pedigrees starting from an arbitrary population of different founding parents. As we have seen, pedigrees are defined as binary trees, so that the only input needed for our algorithm are the genotypes at the leaves. Hence, other founding parents than the simple ones we chose can be input at the top of the trees. The only limitation is that the linkage phase of favorable alleles in founding parents must be given. If this linkage phase is not known, it is still possible to compute the gene transmission probabilities conditionally on all possible linkage phases of target genes in the founding parents. These probabilities can then be used for the computation of optimization criteria. As an example, one may use a conservative strategy to minimize cumulated population sizes. First, compute all gene transmission probabilities for the different possible linkage phases. Then consider the linkage phase associated with the smallest probability and compute cumulated population sizes accordingly. Alternatively, one may average the cumulated 19

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

population sizes over all possible linkage phases. In some situations, we have used crosses between intermediate genotypes and what we called a blank parent. A possible alternative to such crosses is to perform a generation of haplo-diploidization. Unfortunately, this technique is not available for all organisms. An interesting case is when the blank parent is a recurrent parent with an elite genetic background in which one wants to introgress all favorable genes. In this case, the last fixation step can be performed after the marker-assisted introgression of the favorable genes in an homogeneous genetic background. It is then possible to combine the present results with those on the optimization of marker assisted introgression strategies that has been studied extensively by various authors (e.g. Melchinger (1990), Hospital et al. (1992), Visscher et al. (1996), Hospital and Charcosset (1997), Hospital (2001), Servin and Hospital (2002)). Another interesting case is to replace the blank parent by one of the founding parents or by any intermediate genotype. Also, one should consider the case where a founding parent, or any intermediate genotype, can participate in more than one cross in the pedigree. Such extensions of the framework, that surely needs more theoretical developments, should be a valuable step towards a complete theory of selection, including convergence with the general case of random mating. Finally, the main limitation of the method proposed here is that the number of possible pedigrees becomes very large even with relatively few loci, so the computer program implementing the exhaustive enumeration cannot handle more than a dozen loci. For larger numbers of loci, one possibility is to apply our method for each chromosome separately (a dozen targets per chromosome being now a bearable bound in real situations), and assume that subsets of loci located on different chromosomes can be cumulated in parallel and then combined in a few generations to obtain the ideotype across chromosomes. This would probably give a reasonably good feeling of what the optimal pedigree across 20

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

chromosomes might be. However, this might not give the exact solution, an unsatisfactory situation from a theoretical viewpoint. To deal with more loci, some intermediate optimization must be used which selects the best sub-pedigree producing a given intermediate genotype. This kind of “pruning” approach can be converted into a dynamic programming algorithm which no longer needs to consider all pedigrees. We are currently exploring this strategy.

References Charmet, G., N. Robert, M. Perretant, G. Gay, P. Sourdille, et al., 1999 Marker-assisted recurrent selection for cumulating additive and interactive QTLs in recombinant inbred lines. Theoretical and Applied Genetics 99: 1143–1148. Dekkers, J. and F. Hospital, 2002 The use of molecular genetics in the improvment of agricultural populations. Nature Review Genetics 3(1): 22–32. Hospital, F., 2001 Size of donor chromosome segments around introgressed loci and reduction of linkage drag in marker-assisted backcross programs. Genetics 158(3): 1363–1379. Hospital, F. and A. Charcosset, 1997 Marker-assisted introgression of quantitative trait loci. Genetics 147(3): 1469–1485. Hospital, F., C. Chevalet, and P. Mulsant, 1992 Using markers in gene introgression breeding programs. Genetics 132(4): 1199–1210. Hospital, F., I. Goldringer, and S. Openshaw, 2000 Efficient marker-based recurrent selection for multiple quantitative trait loci. Genetical Research 75: 357– 368. Hospital, F., L. Moreau, F. Lacoudre, A. Charcosset, and A. Gallais, 21

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

1997 More on the efficiency of marker assisted selection. Theoretical and Applied Genetics 95: 1181–1189. Lande, R. and R. Thompson, 1990 Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124(3): 743–56. Lange, C. and J. Whittaker, 2001 On prediction of genetic values in markerassisted selection. Genetics 159(3): 1375–81. Melchinger, A., 1990 Use of Molecular markers in breeding for oligogenic disease resistance. Plant Breeding 104: 1–19. Meuwissen, T., B. Hayes, and M. Goddard, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157(4): 1819–29. Moreau, L., A. Charcosset, F. Hospital, and A. Gallais, 1998 Markerassisted selection efficiency in populations of finite size. Genetics 148(3): 1353–65. Rohlf, F., 1983 Numbering binary trees with labeled terminal vertices. Bull Math Biol 45(1): 33–40. Servin, B. and F. Hospital, 2002 Optimal positioning of markers to control genetic background in marker-assisted backcrossing. Journal of Heredity 93(3): 214–217. van Berloo, R. and P. Stam, 1998 Marker-assisted selection in autogamous RIL populations: a simulation study. Theoretical and Applied Genetics 96: 147–154. Visscher, P., C. Haley, and R. Thompson, 1996 Marker-assisted introgression in backcross breeding programs. Genetics 144(4): 1923–32.

22

B. Servin et al., Towards a theory of marker assisted gene pyramiding

18th October 2003 – 01 : 24

Appendix A Here we show how to compute the total number of pedigrees. We start from the recursion relation (3):   n 1 X A(p) A(n − p) . A(n) = 2 1≤p≤n−1 p

(A.1)

The initial condition is A(1) = 1. We introduce the generating function g(u) =

∞ X A(p) p=1

p!

up .

(A.2)

Using (A.1), one finds that the function g(u) satisfies the equation 1 g(u) = g(u)2 + u , 2

(A.3)

which gives g(u) = 1 −



1 − 2u

(A.4)

Now recall the series expansion: √



1 X Γ(p − 1/2) xp 1−x=− 2 p=0 Γ(p + 1)Γ(1/2)

where Γ is Euler’s Gamma function (which satisfies Γ(x + 1) = xΓ(x) and Γ(1/2) =

(A.5) √

π).

If we plug in A.5 into A.4, the identification of the resulting coefficients with those in A.2 leads to   1 2p−1 = (2p − 3)!! A(p) = √ Γ p − 2 π

23

(A.6)

n A(n)

3

4

3 15

5

6

7

105 945 10395

8

10

20

135135

3.4 × 107

8.2 × 1021

Table 1: The number A(n) of distinct pedigrees for the cumulation of n genes.

G

PWC1

PWC2

MBRS

Ntot

Nmax

Ntot

Nmax

Ntot

Ng

5

4415

1248

-

-

-

-

6

2741

1248

-

-

-

-

7

2421

870

1147

341

7560

1080

8

2183

606

1166

341

3440

430

9

1394

341

1273

341

1710

190

10

-

-

-

-

1100

110

11

-

-

-

-

880

80

12

-

-

-

-

840

70

Table 2: Minimal population sizes needed to cumulate 8 genes following different cumulating strategies: by optimization of pedigrees (P W C1 and P W C2) or by marker based-recurrent selection (MBRS). The results obtained with each strategy is presented for different duration of the breeding scheme (G). The cumulated population size (Ntot ) and the greatest population size (Nmax ) of each breeding scheme are given. The 8 loci are placed on a single chromosome, the distance between adjacent loci is 20 centiMorgans. The different strategies are described in the text.

Figure Legends Figure 1 Example of a breeding scheme cumulating six target genes. Graphical representation of the objects defined in the Methods section, see text for details. Figure 2 Details of a node of a pedigree. The gametes (subsets of genes) passed on from the parents H(1)(2) and H(3)(4) to the intermediate genotype H(12)(34) are denoted s, as well as the gamete the intermediate genotype should pass on as a parent of the next node. Figure 3 Example of progress of the algorithm when building all pedigrees cumulating 4 loci. For pedigree heights greater than two, only a few of the pedigrees are given. Gray circles represent intermediate genotypes cumulating less than 4 favorable genes. Intermediate genotypes that cumulate all 4 genes are labeled H(s1 )(s2 ) (see meaning in text). Dashed boxes indicate the two sub-pedigrees merged at a given step. Figure 4 Representation of three different pedigrees cumulating 4 loci. Pedigree (a) is a cascading pedigree. Pedigrees (b) and (c) differ by the order of crosses of the founding parents. The target genes are represented by black circles. Other genes are represented by gray boxes. At each node we give the transmission probabilities of the targeted genes from parent to offspring. When the probability equals one, it is not indicated. The population sizes needed at each node (N ) and the cumulated population size (Ntot ) are given. Figure 5 The two-step hybridization procedure for obtaining an intermediate genotype carrying favorable alleles at four loci (1,2,3,4) from two parents carrying favorable alleles at, respectively loci (1,2) and (3,4). The first step is performed by crossing each parent with a blank genotype (not represented here). The resulting offsprings, carrying the corresponding target genes in coupling phase on one of their chromosomes,

are then mated to obtain the desired genotype, H(12)(34) .

Founding parents P3

H (1)(2)

P4

P5

H (3)(4)

P6

H (5)(6)

H (12)(34)

Generation 0

Generation 1

Generation 2

H (1234)(56)

Generation 3

Fixation Steps

Node

Root Genotype

Ideotype H (123456)(123456)

Figure 1:

Breeding Scheme

P2

Pedigree

P1

H (1)(2)

H (3)(4)

s 1 =(1)

s 1 =(3)

s 2 =(2)

s 2 =(4)

s = (1,2)

s = (3,4)

H (12)(34) s 1 =(1,2) s 2 =(3,4) s = (1,2,3,4)

Figure 2:

P1

P2

P3

P4

P1 P2

P1 P3 P1 P4

P1 P2

P3 P4

founding parents : h=0

P2 P3

P1 P2

P2 P4

P3 P4

h=1

P3

h=2

etc ... H (12)(34) P1 P2

P3

P4

P1 P2 P4

P3

h= 3

etc ... H (123)(4)

H (124)(3)

Figure 3:

2

P2

N = 78

1  1−r 12 1−r 23 1−r 34  2

N = 78

1 1−r 12 1−r 23  r 34 2

N = 120

1 1−r 12 r 23 2

N = 55

1 r 2 12

P1

P3

P4

Ntot =331

B

a

Ntot = 802

1 r 2 23

N = 676

1 r 2 12

N = 78

Figure 4:

2 1  1−r 12 1−r 23 1−r 34  2

2

P2

N = 78

1  1−r 12 1−r 23 1−r 34  2

N = 78

B

P3

N = 404

P2

1 1−r 12  r 23 1−r 34  2

P4

P1

1 r 1−r 23  r 34 2 12

N = 318

1 r 2 14

P1

b P3

B

Ntot = 834

1 r 2 34

P4

c

H (1)(2)

H (3)(4)

s 1 =(1) s 2 =(2)

s 1 =(3) s 2 =(4)

s = (1,2)

s = (3,4)

H (1)(2)

H (3)(4)

s 1 =(1,2)

s 1 =(3,4)

s = (1,2)

s = (3,4)

H (12)(34) s 1 =(1,2) s 2 =(3,4) s = (1,2,3,4)

Figure 5:

A note on the optimization of cascading pedigrees to cumulate multiple linked target genes Servin B., Martin O. C., M´ezard M. and Hospital F.

A cascading pedigree involves (n−1) steps (crosses), such that at step k, an intermediate genotype issued from step (k − 1) (or a founding parent for k = 1) and carrying k target genes is crossed to a founding parent bringing one additional target gene (i.e. a gene not amongst the k). An offspring of this cross, carrying the k + 1 genes, is again crossed to a new founding parent (step k + 1) and so on until k = n − 1. From the definition of cascading pedigrees stated above, we can note that cascading pedigrees differ only by the order following which the founding parents are crossed. Hence, in order to define a cascading pedigree, one only need to define a list of n elements containing each founding parent once. One must notice that the choice of the order of the parents in the first step does not matter. Indeed, crossing in the first place say P1 and P2 on the one hand or P2 and P1 on the other hand is equivalent. Hence the total number of cascading pedigrees (C(n)) is the number of possible ordered lists of n parents halved because of the redundancy implied in the first cross, i.e. 1 C(n) = n! 2

(1)

We focus on a step k + 1 of a cascading pedigree (figure). At step k + 1, the pedigree poduces and intermediate genotype Gk+1 issued from the cross of an intermediate genotype Gk , carrying k target genes, and a founding parent Pj carrying the additional target j. 1

Because all founding parents are assumed homozygous, the probability that Pj passes on traget j to Gk+1 is one. The probability that Gk passes on his k targets to Gk+1 is the probability that target i (inherited from parent Pi at previous step) is incorporated in the same gamete as the other k − 1 targets by recombination. Let sk be the set of loci of those k target genes. We call here adjacent loci the pairs of loci that are located next to each other on the genetic map. In other words, l1 and l2 are two adjacent target loci if there is no other target lying bewtween l1 and l2 on the map. We want to compute the probability P (P) that the ideotype I ∗ (i.e. the genotype combining all n target genes) is obtained at the end of a pedigree P; and use this probability to compare he efficiency of different pedigrees (i.e. different orders of crossing the founding parents). This probability is based on the probabilities of recombination between target genes. The difficulty in computing recombination probabilities stems from the fact that the loci in sk−1 may not be all adjacent to each other on the map, as well as target loci i may not be adjacent to any locus in sk−1 . However, it is clear that over the entire pedigree there should be at least one recombination between adjacent loci. Hence, for any cascading pedigree P, the probability to obtain the ideotype is the product of a constant factor λn , which depends on the genetic map (recombination fractions between adjacent loci) but not on the pedigree, times a variable factor v(P) which depends both on the genetic map and on the pedigree (i.e. order of crossing parents): f (P) = λn × v(P) with λn =

 1 n−1 n−1 Y 2

(2) ri,i+1

i=1

where ri,i+1 is the recombination fraction between adjacent loci i and i + 1.

2

(3)

Combining genes at adjacent loci We first compare v(P) in the case where the additional target gene i brought by the founding parent Pi at each step k + 1 is at a locus adjacent to the loci already cumulated in sk i.e. all loci in sk are adjacent to each other, and locus i is either before the first locus of sk or after the last one. Note that the pedigree does not start necessarily with neither the first or last locus of the map, so they are (n − 1) possibilities to choose the first pair of adjacent loci to start the pedigree with. At each step we want to keep the loci already cumulated in coupling phase . That is we do not want any recombination between any pair of already cumulated adjacent loci. The probability of this event is thus the product of (1 − rk,k+1 ) overall possible couples of adjacent loci l and l + 1 included in sk . Hence, we can derive a general formula for v(P) when combining genes at adjacent loci: v(P) =

n−2 Y

(1 − rik ,jk )n−1−k

(4)

k=1

where the loci ik and jk are the two adjacent loci cumulated at step k in the pedigree P.

General case In this part, we will demonstrate that a pedigree cascading loci in the order of the genetic map gives the ideotype with a higher probability than any other cascading pedigree.( peutetre pas tres clair). Suppose three loci (i,j,k), located in that order on the genetic map (i.e. j is between i and k on the map). We consider sub-pedigrees combining these three loci. We compare the probability associated to the pedigree cumulating i and j in a first step and (i, j) and k in a second step (P(i, j, k)), to the pedigree combining the same loci in a different order (P(i, k, j)). Besides i, j and k, we assume the sub-pedigrees P(i, j, k) and P(i, k, j) cumulate other loci in the same way (aka: toute chose etant egale par

3

ailleurs). That is the overall probability of giving the ideotype for the two pedigrees is: f (P(i, j, k)) = α × α ¯ (P(i, j, k))

(5)

f (P(i, k, j)) = α × α ¯ (P(i, k, j))

(6)

where α is the factor corresponding to the cumulation of the other loci beside i, j and k; and α ¯ is the factor corresponding to the cumulating of loci i, j and k. Note that α is the same for the pedigrees P(i, j, k) and P(i, k, j) while there is a specific α ¯ for each pedigree. P(i, j, k) case If we lay aside possible loci lying between i and j or j and k, we can compute α ¯ in the same way as for pedigree combining adjacent loci (equation 4), so that: f (P(i, j, k)) = α × (1 − ri,j )n−1−t1 (1 − rj,k )n−1−t2 ri,j rj,k

(7)

P(i, k, j) case In this case, we first cumulate loci i and k at generation t1 . In this step, we need a recombination between i and k so that: α ¯ (P(i, k, j)) ∝ ri,k

(8)

If we cumulate locus j at generation t2 , we must make sur that no recombination occured between i and k for t2 − t1 − 1 generations, so that: α ¯ (P(i, k, j)) ∝ (1 − ri,k )t2 −t1 −1

(9)

At generation t2 we need both a recombination between i and j and between j and k. Finally, we have cumulated loci i, j and k and we need to make sure no recombination occurs in the two corresponding intervals until the end of the pedigree, so that: α ¯ (P(i, k, j)) ∝ (1 − ri,j )n−1−t2 (1 − rj,k )n−1−t2 ri,j rj,k

(10)

Finally, we have computed the specific factors α ¯ of both sub-pedigrees: α ¯ (P(i, j, k)) = (1 − ri,j )n−1−t1 (1 − rj,k )n−1−t2

(11)

α ¯ (P(i, j, k)) = ri,k (1 − ri,k )t2 −t1 −1 (1 − ri,j )n−1−t2 (1 − rj,k )n−1−t2

(12)

4

Dividing (11) by (12) one finds that: α ¯ (P(i, j, k)) (1 − ri,j )t2 −t1 = α ¯ (P(i, k, j)) (1 − ri,k )t2 −t1 −1 ri,k !t2 −t1 −1 1 − ri,j 1 − ri,j = × 1 − ri,k ri,k

(13)

As k is more distant than j from i we know that ri,j < ri,k so that (1 − ri,j ) > (1 − ri,k ). Furthermore, as 0 ≤ ri,k ≤ 0.5, we know that 1 − ri,j > ri,k . Finally, we found that

α ¯ (P(i, j, k)) >1 α ¯ (P(i, k, j))

(14)

α ¯ (P(i, j, k)) > α ¯ (P(i, k, j)) From this, we can dervie a rule for ranking pedigree on their probability to give the ideotype. If, in a given pedigree, two non-adjacent loci, say i and k, are cumulated at one step, a loci, say j, lying between i and k will be cumulated later on. In this case, we know how to find a best pedigree. It will be the one that cumulates loci i, j and k in that order, the order of crosses for cumulating other loci remaining unchanged. By applying this rule over the whole pedigree, we show that pedigrees combining adjacent loci are necessarily better than any other cascading pedigrees.

5

Cinqui` eme partie Conclusion et Perspectives

XLI

Le d´eveloppement de m´ethodes de construction de g´enotypes est ´etroitement li´e `a l’´etude des s´egr´egations all´eliques dans des syst`emes multilocus et multig´en´erationnels. Le nombre de dimensions des probl`emes consid´er´es est donc tr`es grand et l’optimisation de ces syst`emes devient vite tr`es complexe. Dans la recherche d’une solution optimale pour un programme de construction de g´enotypes, le d´eveloppement d’outils informatiques performants est donc essentiel, car la r´esolution des probl`emes complexes pos´es ne peut pas ˆetre envisag´ee ”sur papier”. Mais ces outils ne peuvent ˆetre une fin en soit, d’une part parce que la complexit´e des probl`emes est telle qu’il est relativement facile d’atteindre les limites des performances informatiques actuelles (et mˆeme futures) et d’autre part parce qu’un outil informatique ne permet finalement que d’explorer l’espace des possibles ”en aveugle” et ne permet pas de pr´edire a priori la meilleur solution, mais seulement de la d´ecrire apr`es une exploration exhaustive des solutiosn possibles. Ils ne permettent pas intrins`equement de comprendre comment fonctionnent les syst`emes multilocus, d’identifier des r`egles g´en´erales qui s’y appliquent. Seule l’analyse humaine des r´esultats fournis par ces outils informatiques peut le permettre. La meilleure compr´ehension des sch´emas de construction de g´enotypes permettra de mettre au point des outils informatiques plus performants, et donc de repousser les limites de leur domaine d’´etude. Dans le cas du backcross assist´e par marqueurs, les optimisations encore possibles risquent d’avoir un impact pratique marginal sur l’efficacit´e des sch´emas exp´erimentaux r´eels, pour deux raisons. Tout d’abord, les m´ethodes existantes sont d´ej`a efficaces, il reste peu de place `a de nouveaux d´eveloppements th´eoriques. Par ailleurs, en pratique, les param`etres exp´erimentaux deviennent des facteurs limitants pour l’utilisation des r´esultats th´eoriques (e.g. le nombre de marqueurs disponibles, la pr´ecision des cartes g´en´etiques, la production des donn´ees de g´enotypage ...). En revanche, il est important de diffuser les r´esultats d’optimisation connus aupr`es de la communaut´e de l’am´elioration des plantes. Il est aujourd’hui possible de d´efinir un cahier des charges pour la mise au point d’un programme d’aide `a la d´ecision performant. Les perspectives sur ce sujet se situent donc clairement dans le domaine de l’informatique. En revanche, l’´etude de sch´emas de pyramidage de g`enes est ouverte `a de nombreuses perspectives de recherche en g´en´etique. Dans ce cas, la complexit´e du syst`eme est tr`es grande et nous n’en connaissons pas les r`egles d’optimisation, mˆeme pour peu de g`enes. Il s’agit l`a d’une probl´ematique de recherche tr`es int´eressante, qui, en dehors de l’aspect pratique pour la mise au point de programmes de construction de g´enotypes r´eels, fournira certainement des r´esultats int´eressants sur le fonctionnement de syst`emes g´en´etiques multilocus complexes, qui sont rencontr´es dans de nombreux domaines de la g´en´etique. L’´etude de l’´evolution des fr´equences g´en´etiques `a de nombreux locus li´es dans des programmes de s´election pr´esente de nombreux facteurs de complexit´e. Dans les ´etudes th´eoriques existantes, certains de ces facteurs sont pris en compte et d’autres sont g´en´eralement ignor´es. Le d´eveloppement de nouvelles questions de recherche dans ce domaine peut s’articuler sur la prise en compte de ces facteurs : XLIII

Liaison g´ en´ etique La principale complexit´e dans l’´etude de syst`emes g´en´etiques multilocus (comme des programmes de construction de g´enotypes) vient du fait que les g`enes manipul´es sont li´es g´en´etiquement. En effet, lorsque les g`enes ne sont pas li´es, toutes les probabilit´es de recombinaison entre g`enes sont identiques et la notion d’ordre des g`enes sur le g´enome n’existe pas. Lorsque l’on s’int´eresse, comme c’est notre cas ici, `a des syst`emes g´en´etiques faisant intervenir des locus li´es sur des chromosomes, cela introduit une complexit´e suppl´ementaire. Le d´eveloppement de marqueurs mol´eculaires a permis d’´etablir des cartes g´en´etiques denses permettant de ”visualiser” les ´ev`enements de recombinaison sur les chromosomes. Ceci apporte la possibilit´e de manipuler en s´election des g`enes cartographi´es en s´electionnant les gam`etes recombin´es favorables. Cependant l’optimisation de la s´election dans ce cadre est complexe. Tout d’abord les locus manipul´es sont ordonn´es sur le g´enome, chacun des locus doit donc ˆetre consid´ere sp´ecifiquement car ils ne sont plus ind´ependants. Par exemple, nous ne savons pas a priori quel est l’ordre optimal de cumul des g`enes. Ensuite, du fait de la liaison g´en´etique, la s´election effectu´ee au niveau d’un locus a une influence sur ce qui se passe dans son voisinage. Ceci se traduit par exemple, par l’introgression de grandes r´egions autour des g`enes cibles en backcross classique. L’optimisation de la s´election sur marqueurs pour r´eduire ces segments n’est pas triviale. Optimisation sur plusieurs g´ en´ erations Les ´ev`enements de recombinaison favorables dans un programme de construction de g´enotypes sont associ´es `a des probabilit´es faibles du fait de la liaison g´en´etique entre locus. Dans certains cas, il est possible d’obtenir le g´enotype voulu en une seule g´en´eration, mais en g´en´eral la probabilit´e d’obtenir le g´enotype id´eal en une seule g´en´eration est tr`es faible et il est n´ecessaire de mettre au point des m´ethodes sur plusieurs g´en´erations. L’exemple du backcross est d´emonstratif : il est possible d’obtenir l’id´eotype aux marqueurs en une seule g´en´eration de backcross (i.e. la probabilit´e de l’´ev`enement est non nulle). Cependant, la probabilit´e de cet ´ev`enement est tellement faible qu’il est n´ecessaire d’effectuer plusieurs g´en´erations de backcross pour obtenir l’id´eotype. Ceci parait ´evident, mais il est important de noter que cela introduit une dimension suppl´ementaire dans le syst`eme, car il faut alors optimiser la succession des ´etapes au cours des g´en´erations. Ceci est a fortiori vrai dans le cas o` u l’id´eotype ne peut pas ˆetre obtenu en une seule g´en´eration (i.e la probabilit´e de l’´ev`enement est nulle). Nous avons par exemple montr´e dans Servin et al. (soumis) qu’augmenter l´eg`erement la dur´ee des programmes de cumul de g`enes en modifiant les croisements entre individus, permettait de diminuer les tailles de population n´ecessaires de mani`ere drastique. Phase de liaison : de l’incertitude comme principe Dans le cas du backcross, la phase de liaison des all`eles est parfaitement connue. Les ´ev`enements de recombinaison favorables sont donc identifiables facilement car en observant la descendance d’un croisement nous pouvons observer directement les gam`etes des parents. Dans le cas g´en´eral ceci n’est pas toujours possible, il faut alors estimer les phases de liaison g´en´etique. A chaque phase de liaison est alors associ´ee une probabilit´e. Il faut alors conditionner les r´esultats obtenus aux diff´erentes phases de liaison possibles. C’est par exemple ce que fait le programme XLIV

MDM lorsque les phases de liaison sont inconnues chez les descendants d’un croisement. Deux solutions sont alors possibles : – consid´erer chaque phase de liaison possible avec sa probabilit´e associ´ee. C’est implicitement accepter que les r´esultats trouv´es sont faux car en pratique une seule phase parmi les possibles est effectivement pr´esente. On obtient alors un r´esultat moyen qui a de bonnes chances de ne pas trop s’´ecarter de la r´ealit´e. – essayer de d´eterminer la phase la plus probable de l’individu, et accepter de se tromper en toute connaissance de cause. Il est alors important d’utiliser le plus d’information possible pour d´eterminer la phase gam´etique `a partir des g´enotypes. Les m´ethodes les plus prometteuses me semblent ˆetre les m´ethodes dites MCMC (m´ethodes de Monte Carlo par Chaines de Markov), car elles permettent d’utiliser le maximum d’information disponible pour la pr´ediction de param`etres inconnus. Vers le multiall´ elique : de l’accroissement de la combinatoire La plupart des ´etudes th´eoriques effectu´ees supposent de se trouver dans un syst`eme o` u deux all`eles sont possibles `a chaque locus. Ces situations sont en effet assez fr´equentes et il est souvent possible de se ramener `a des cas biall´eliques en groupant l’ensemble des all`eles favorables aux g`enes d’int´erˆet d’une part, et l’ensemble des all`eles d´efavorables `a ces g`enes d’autre part. Cependant, lorsqu’il n’existe pas d’a priori sur les valeurs des diff´erents all`eles rencontr´es (comme par exemple avant d’avoir effectu´e la d´etection de QTL), il est n´ecessaire de consid´erer des plans de croisements dans lesquels s´egr`egent effectivement de nombreux all`eles `a chaque locus. Ceci pose un probl`eme certain car, `a nombre de locus consid´er´e ´egal, augmenter le nombre d’all`ele `a chaque locus augmente fortement le nombre de g´enotypes multilocus possibles et donc la combinatoire de l’espace `a explorer. C’est la raison pour laquelle, par exemple, le programme MDM est tr`es rapidement limit´e en nombre d’all`eles possibles `a chaque locus et ne peut consid´erer que trois locus simultan´ement lorsque quatre all`eles sont possibles `a chacun des locus. Interf´ erence et recombinaison Classiquement, tous les calculs effectu´es dans des ´etudes th´eoriques de d´etection de QTL ou de s´election assist´ee par marqueurs supposent que les crossing over ont lieu sans interf´erence au cours de la m´eiose. Les r´esultats obtenus sont g´en´eralement assez d´ependants de cette hypoth`ese. Dans le cas de la plupart des esp`eces v´eg´etales, il ne semble pas s’agir d’une hypoth`ese particuli`erement forte. Il serait cependant int´eressant d’´etudier l’influence de la prise en compte de ph´enom`enes d’interf´erence dans l’analyse de pedigree complexes pour la quantifier. En effet, il est difficie de pr´edire a priori l’impact de l’interf´erence sur l’efficacit´e de la s´election assist´ee par marqueurs. Cet impact d´epend non seulement de la relation d’interf´erence, mais aussi du type de s´election effectu´ee. par exemple, on peut penser que l’interf´erence `a courte distance rendrait plus facile la conservation de l’int´egrit´e des segments contenant des QTL introgress´es ou cumul´es, ce type de s´election assist´ee par marqueurs serait alors plus efficace avec interf´erence que sans. Inversement, une interf´erence `a grande distance pourrait p´enaliser le pyramidage de g`enes en empˆechant certains ´ev`enements de recombinaison favorables. XLV

XLVI

Annexes

XLVII

IBD- based QTL detection in multi-cross inbred designs: A case study of cereal breeding programs

Sébastien Crepieux*,1, Bertrand Servin† , Claude Lebreton‡ and Gilles Charmet* *UMR 1095 INRA-UBP, 234 Av. du Brezet, 63039 Clermont-Ferrand Cedex 2, France, †



INRA UMR de Génétique Végétale, INRA/UPS/INAPG, 91 190 Gif sur Yvette, France

Limagrain Agro-Industrie, site d’ULICE, av G. Gershwin, BP173, F-63204 Riom Cedex, France

1

IBD-based multi-cross QTL mapping

1

: corresponding author : 1095 INRA-UBP, 234 Av. du Brezet, 63039 Clermont-Ferrand

Cedex 2, France. Email : [email protected] Phone : 00 33 4 73 62 43 09 , Fax: 00 33 4 73 62 44 53

Key words: QTL detection, variance component, pedigree breeding, IBD, multi-cross

ABSTRACT Mapping quantitative trait loci in plants is usually conducted using a population derived from a cross between two inbred lines. The power of such QTL detection and the parameter estimates highly depend on the choice of the two parental lines. Thus, the QTL detected in such populations only represent a small part of the genetic architecture of the trait. Besides, the effects of only two alleles are characterised, which is of limited interest to the breeder. On the other hand, common pedigree breeding material remains unexploited for QTL mapping. In this study, we extend QTL mapping methodology to a generalized framework, based on a two-step IBD variance component approach, applicable to any type of breeding population coming from inbred parents. The power and accuracy of this method were assessed on simulated data mimicking conventional breeding programs in cereals. This method can provide an alternative to the development of specifically designed recombinant population, by exploiting the genetic variation actually managed by plant breeders. The use of these detected QTL in assisting breeding would thus be facilitated.

2

INTRODUCTION The availability of molecular markers in the 1980’s has opened new scope for quantitative genetics and breeding. It was thus anticipated that the manipulation of loci underlying quantitative traits (QTL) would be as easily feasible as with mendelian factors. This perspective, however, has remained largely unreached, despite the large corpus of theoretical studies on marker assisted selection ( e.g. LANDE and THOMPSON 1990; GIMELFARB and LANDE 1994 , 1995 ; HOSPITAL et al. 1997). The main reason is probably the cost of markers and the relatively low improvement in selection efficiency that leads MAS to be generally much more expensive than conventional breeding (MOREAU et al. 2000). The other reason is that applied breeding programs and QTL research are often disconnected, i.e. carried out by different teams and using different plant material. Classically, QTL analyses are carried out on a few progenies from broad base crosses, coming from a small number of distantly related lines, often including wild relatives. Such analyses mostly involve bi-parental progenies such as back-crosses (BC), doubled haploids lines (DH), F2 or recombinant inbred lines (RILs). In the approaches based on this kind of plant material, the effect of an allele substitution at a candidate locus is tested. This is called the fixed model approach (XU and ATCHLEY 1995) since it considers a fixed number of distinct alleles (most often two) at each putative QTL. Statistical methods for the QTL analysis of bi-parental populations underwent successive improvements through the advent of Interval Mapping (LANDER and BOTSTEIN 1989) and its linearization (HALEY and KNOTT 1992), the Composite Interval Mapping (ZENG 1993, 1994 ; JANSEN 1993) and multiple trait QTL mapping (KOROL et al. 1995; JIANG and ZENG). On the other hand, the breeder’s material is far from the studied bi-parental populations. Breeders generally handle many small families from crosses between (often) highly related elite lines. Thus, the above described methods are poorly adapted. Moreover there are many

3

drawbacks for the breeder’s use of the QTL found on bi-parental populations. First, when only two parents are considered, some markers and potential QTLs are more likely to be monomorphic, even if parental lines are carefully selected for trait divergence. Since, by definition, QTL can only be found at polymorphic sites in the genome, the expected number of QTL detected with a bi-parental cross will be lower than that expected when analyzing several crosses at a time (assuming the total number of genotypes is not the limiting factor). The second drawback is that the QTL effect is estimated as a contrast between two alleles and in one genetic background only. Therefore, in that context, the improvement of a line by the introgression of a QTL allele in a completely new genetic background is rather unpredictable, because of possible epistatic interaction between QTL and genetic background. Finally, from an economic standpoint, the cost of creation of large single cross progenies and specific trials for trait evaluation to perform QTL detection is quite high and often at the expense of other selection programs. All these drawbacks reduce the breeders’ interest for implementing such experimental designs when funding and work are constrained. Bi-parental crosses are usually preferred for more upstream studies, e.g. genomics: the fine-mapping of a QTL, which is a pre-requisite for its positional cloning, is easier when fewer QTLs are segregating. In contrast, the breeders’ focus will be to characterize the effect of a wide range of alleles in his germplasm. Methods for simultaneous detection and manipulation of QTL in breeding programs would thus enhance the applicability of MAS. In plant breeding, new methods for QTL detection in complex designs, close to those used in real breeding schemes have already been developed. MURANTY (1996) suggested to work in plants with progenies from several parents, in order to achieve a high probability to have more than one allele at a putative QTL, and also to have a more representative estimate of the variance accounted for by a QTL. She operated in a fixed-effect framework. Simulations

4

demonstrated that a higher QTL detection power was achieved, for a given sample size. XU (1998) compared the QTL detection powers obtained with random effect models and fixed effects and found similar values for individual family sizes as low as 25 individuals. However, in more unbalanced designs, the random effect approach was presumed to be more suited as it can handle any arbitrary pedigree of individuals (LYNCH and WALSH 1998, XU 1998). Efficient methodologies for more fragmented populations in plants have been developed (XIE et al. 1998; YI and XU 2001, BINK et al. 2002, JANSEN et al. 2003 for example), but their extension or implementation for any complex plant designs, implying a mixing of half-sibs and full-sibs families of different sizes, at any generation of selfing and with the hermaphrodite status of parents, is not straightforward. The identical-by-descent (IBD)-based variance component analysis is a powerful statistical method for QTL mapping in complex populations and can be used in pedigrees of arbitrary size and complexity (ALMASY and BLANGERO 1998). These IBD-based variance component analyses are derived from the assumption that individuals of similar phenotype are more likely to share alleles that are identical by descent. The construction of IBD matrices for alleles at each tested position along the genome, and the fitting of random effect models (which assumes that QTL effects are normally distributed) offer an appropriate method to map QTL if the progeny population is large enough and if the progenies are connected in some way. Besides, these models do not need to assume a known, finite set of alleles at each putative QTL. Thus, they offer a less parameterized statistical environment in which to map QTL, because only the variances need to be estimated instead of every allele substitution effect. Generally, the IBD-based approaches assume a between family IBD-likelihood of zero (i.e. no parents in common between the two families), and thus, consider the parents as founders. However, this assumption is often wrong in common breeding pedigrees. Furthermore, in fragmented situations, i.e. where there are many families of small sizes (especially when the genotyping

5

takes place at a late stage in pedigree breeding, where we may easily end up with as few as one or two lines per cross), the IBD-likelihood matrix can be very sparse. Hence, there could be much to be gained in exploring the actual between family IBD-likelihoods.

In the work related in this paper, we took over these developments and further assumed a nonzero IBD-likelihood between non-sib lines. We then present a unified IBD-based variance component analysis framework, to map QTL in any kind of multi-cross designs involving self-pollinating species, at any generation. To test the accuracy of the method, we developed a simulation program which mimics the steps of real breeding schemes. We chose the simulation parameters according to the information provided by breeders on their real material in order to be as comprehensive as possible in the range of genetic configurations explored. This method can provide an alternative to the development of specifically designed recombinant population, by exploiting the genetic variation actually managed by plant breeders. Our method can thus provide breeders with valuable information about the breeding values of their material and help them to design selection strategies.

METHODS Two-step IBD based variance component method The method used to map QTL in a complex inbred pedigree is a two step variance component method, as described in GEORGE et al. (2000). Hence, this method first consists in constructing the (co)variance matrix of fixed and random effects at each putative QTL position and then estimates the likelihood of the presence of a QTL at these positions using appropriate linear models.

6

These two steps are common to all interval mapping based variance components methods. What differs mainly among all the published methods is the way to calculate the IBD probabilities (see GEORGE et al. 2000 for a review of IBD probabilities calculation). We adopted a deterministic approach to infer IBD probabilities for any generation of recombination and breeding scheme (e.g. F2, Fn, RIL, BCn), based on the MDM program (SERVIN et al. 2002).

Mixed linear models We assume that the quantitative trait is a linear combination of fixed design effects, a putative QTL effect (with additive or/and dominance effect) and additive polygenic effects. The random polygenic effect is seen as the cumulative effect of all loci affecting the quantitative trait that are unlinked to the QTL. The model is:

y = Xβ + Zu + Zv + e (1) where y is an (m*1) vector of phenotypes, X is an (m*s) design matrix, β is a (s*1) vector of fixed effects, Z is an (m*q) incidence matrix relating records to individuals, u is a (q*1) vector of additive QTL effects, v is a (q*1) vector of additive polygenic effects and e is the residual. We assume that the random effects u, v and e are uncorrelated and distributed as multivariate normal densities: u ~ N (0, Gσ u2 ) ; v ~ N (0, Aσ v2 ) ; e ~ N (0, Iσ e2 ) , with σ u2 , σ v2 and σ e2 being

respectively the additive variance of the QTL, the polygenic variance and the residual variance. A is the (q*q) additive genetic relationship matrix; G is the (q*q) (co)variance matrix for the QTL additive effects conditional on marker information; and I is the (m*m) identity matrix.

The model without QTL segregating in the population is, with the same notations: y = Xβ + Zv + e (2)

7

Computation of the IBD probabilities

We consider a mapping population composed of several sub-populations of small size. Each of these sub-populations is an offspring coming from an inbred pedigree started with two parents. For example, these sub-populations could be produced by several consecutive selfings (e.g. RILs) or back-crossings. We want to compute the probability that two individuals taken from any of these sub-populations share IBD alleles at a given locus of their genome. If we consider a pair of individuals from the mapping population, they may be (i) taken from the same sub-population, in which case they are full-sibs (ii) taken from two different sub-populations. In this last case, if one of the parents is common to the two subpopulations, the two individuals will be half-sibs, if the parents of the two sub-populations are distinct, the two individuals are considered as unrelated. We will now draw relevant calculations of the IBD probabilities for each of these cases. IBD value between two sibs at a QTL: Within each full-sib family of the breeding scheme,

only two alleles are segregating giving only three possible genotypes at the QTL: QQ, Qq and qq. Following XIE et al. (1998) notations, the IBD value of two individuals i and j, is measured as:

π i , j = 2θ i , j

2  = 1 0 

for QQ − QQ, or qq − qq for for

QQ − Qq, qq − Qq, or Qq − Qq QQ − qq,

where π i, j are the ijth elements of G, and θ i, j is MALECOT’s (1948) coefficient of coancestry. As pointed out by many authors, when inbreeding is present, π i, j is not interpreted as the proportion of alleles IBD, but rather as twice the coefficient of coancestry (KEMPTHORNE 1955; HARRIS 1964; COCKERHAM 1983).

8

Inferring the IBD value of a QTL from markers: The IBD value is completely determined

by the genotypes of two individuals at the QTL of interest. The actual QTL genotype of an individual, however, is not observable, and must be inferred from flanking marker information. We denote the following probabilities: p j 2 = Pr(QQ / I M ), p j1 = Pr(Qq / I M ), and p j 0 = Pr(qq / I M ) .

[

We write pi = [ pi 2 pi1 pi 0 ] T and p j = p j 2 p j1 p j 0

]

T

. The conditional expectations of the

IBD values between two full sibs are:

π i , j = E (π i , j / I M ) = piT Cp j for between individuals, and π i , j = E (π i , j / I M ) = c T p j for the individual with itself, where:  2 1 0 C = 1 1 1  and 0 1 2

1  c = 2 1 

In the rest of the paper, this formula, for computing IBD probabilities, will be referred to as formula (1). General case: Above, we emphasized the calculation of IBD for the full-sib case. The

generalization to the half-sib case is trivial. Using this first formula to compute IBD probabilities, we assumed that parents of subpopulations were unrelated, i.e. they did not share any common ancestors. However, in practice, these parents were coming from previous generations of breeding and were very likely to share IBD alleles in their genome (due to the intensive use of some “star varieties”, for example). In order to take these possible relationships between parents into account, we estimated coancestry coefficients between them, using molecular information, as suggested by BERNARDO (1993). The pedigree information, if available, could be used to that end. However, the selection pressure may generate a discrepancy in the predicted proportion of parental genomes shared by the current lines. Besides, we assumed that the pedigree

9

information was often very scarce or even unavailable. Hence, we resorted to a genetic similarity based index to estimate that proportion of genome. First, we generalized the C matrix to the known half-sib, full-sib and unrelated individuals by introducing the coancestries between parents, estimated by markers in our case. We considered two individuals taken in two sub-populations. Let P1 and P2 be the parents of the first sub-population and P3 and P4, the parents of the other sub-population. We denote by GSP1P3, GSP1P4, GSP2P3 and GSP2P4 the estimates of the coefficients of coancestry between these parents. Taking into account possible coancestries between P1 and P3 on one hand and P2 and P4 on the other hand, the C matrix can then be re-written as: 2GS P1P 3 C1 =  GS P1P 3  0

 GS P 2 P 4  2 (GS P1P 3 + GS P 2 P 4 ) GS P 2 P 4 2GS P 2 P 4  GS P1P 3

1

0

Note that in the full-sib case P1=P3 and P2=P4, so that GSP1P3 and GSP2P4 are equal to one and the C1 matrix is similar to C. Similarly, the relevant C matrices for half-sibs individual or for unrelated individuals can be obtained by replacing respectively GSP1P3 by one and GSP2P4 by zero on one hand, and both GSP1P3 and GSP2P4 by zero on the other hand. Similarly, taking into account the coefficients between the parents P1 and P4 on one hand and P2 and P3 on the other hand, we can re-write the C matrix as : GS P 2 P 3 2GS P 2 P 3   0  C 2 =  GS P1P 4 1 2(GS P 2 P 3 + GS P1P 4 ) GS P 2 P 3  2GS P1P 4 GS P1P 4 0 

Finally, we can draw a general formula for the conditional expectation of the IBD values between

two

individuals

coming

from

four

(distinct

or

not)

inbred

parents:

π i , j = E (π i , j / I M ) = piT C1 p j + piT C 2 p j

10

i.e. π i , j = E (π i , j / I M ) = 2 ( GS P1P 3 [( p j 2 + 1 2 p j1 )( pi 2 + 1 2 p i1 )] + GS P1P 4 [( p j 2 + 1 2 p j1 )( pi 0 + 1 2 pi1 )] + GS P 2 P 3 [( p j 0 + 1 2 p j1 )( pi 2 + 1 2 p i1 )] + GS P 2 P 4 [( p j 0 + 1 2 p j1 )( pi 0 + 1 2 p i1 )] )

The conditional expectation of the IBD for an individual with itself remains:

π i , j = E (π i , j / I M ) = 2 p j 2 + p j1 In the rest of the paper, this formula, using the C1 and C2 matrices, will be referred to as formula (2). The elements in the additive relationship matrix A are estimates of the proportion of IBD genome between any two lines. The ai,j elements is the proportion of IBD genome between line i and line j, based on genetic similarities. For both A and G matrices, genetic similarities (GSi,j) were computed using NEI and LI (1979) formula. Implementation of the IBD formula

We used the deterministic approach of the MDM program (SERVIN et al. 2002) to compute all the pi and pj probabilities. IBD-likelihoods were computed every 3cM. Two flanking markers were used to infer the genotypes probabilities. In the frequent case where the two parents shared the same marker alleles at one or two loci flanking the putative QTL position, the next closest markers to the interval were used. It can easily be demonstrated that the IBD probabilities calculated at a putative QTL will be more precise if the flanking markers are highly polymorphic. Issues on the use of better estimates of the coefficients of co-ancestries will be discussed on the last part of the article. All the G matrices were then inverted, and written in ASREML (GILMOUR et al. 1998) format for user-defined inverse (co)variance matrices. We also computed the additive relationship matrix A. Then, it was inverted and written in ASREML format (end of step1). In step 2, ASREML provided restricted maximum-likelihood (REML) estimates of (1) and (2). To test for the presence of a QTL against no QTL at a particular chromosomal position,

11

we used the Log Likelihood Ratio test: LR= -2ln( L0(H0, no QTL present) – L1(H1, QTL present) ), where L1 and L0 represent the likelihood values of (1) and (2) evaluated at the REML solutions, respectively. Test statistic under the null hypothesis

The choice of the threshold in this kind of population is always challenging. Many publications (ZENG 1994; XU and ATCHLEY 1995 for example) report that when a chromosomal interval is being scanned, the empirical distribution of LR follows a mixture of two Chi-square distributions, with one and two degrees of freedom, respectively. Since this article deals with simulated data, it is possible to replicate data under the null hypothesis of no QTL segregating, construct the empirical distribution of LR and derive empirical threshold by choosing the 95th percentile of the highest test statistic, generally over 500 or 1000 stochastic realizations. In this paper, we calculated an empirical threshold for every set of parameters to see whether significant differences on threshold appeared or not. Hence, for each set of parameters tested, we ran 1000 additional simulations with no QTL segregating. We increased the polygenic variance such that the total genetic variance remained unchanged. We thus determined the empirical threshold by choosing the 95th percentile from the list of 1000 runs.

A SIMULATION STUDY: The case of a cereal breeding program

We chose the cereal example in the simulation study as it contains most of the difficulties generally encountered in inbred breeding schemes: -

the frequent unavailability of reliable pedigree information, beyond the parents (and thus unavailability of ancestor lines to genotyping)

-

the potential to genotype only advanced generations of selfing, when the number of lines has been narrowed down and the trial precision of trials increased, constraining 12

to compute IBD at the F6 or F7 stage without any marker information between the initial cross and the resulting progenies -

the very high number of parents of the mapping population yielding very small full-sib families, and an uneven (L shaped) distribution of half-sib family sizes

Simulation of the inbred breeding scheme

Every year, new plant breeding programs are started. The choice of the numerous parents for crosses combines on one hand lines with the highest genetic merit for some traits of interest like yield or quality, and on the other hand, some lines of specific interest such as special quality or pest/disease resistance, sometimes taken in old or exotic material. The original crossing scheme in a breeding program is very dependent on the breeder for the choice of the parents, but the breeding process is often closely the same. To reproduce the steps of the breeding programs, an S-PLUS (2000) function was developed. This function was designed by analysing wheat breeders information and allele diversity and frequencies made available by recent studies (DONINI et al. 2000; RODER et al. 2002). All simulations follow a similar procedure: (Figure 1)

Figure 1 around here

Marker-genotypes construction: Before the start of the breeding process, we considered only a few founder lines, with full linkage disequilibrium across all their genome, hence between all the markers and the QTLs. We also imposed that at this generation, no founder lines had any alleles in common with any other. Thus, line 1 carried only markers and QTLs coded 1, line 2 only markers and QTLs coded 2, and the last founder line (the NPth) carried only markers and QTLs coded NP all along its genome. The genome of simulated individuals was composed of 21 chromosomes of 100 cM, with markers evenly spaced every d centiMorgans

13

all along the chromosomes, with two markers on each chromosome telomeres. On chromosome 1, we simulated a single QTL between two markers. In addition to the QTL of interest, we simulated on the other chromosomes Npoly=40 randomly located QTLs (that could therefore be linked or not) with random effects to simulate the polygenic contribution to the trait values. First generation of crosses: From the founder lines generation, circular crosses, i.e. 1*2, 2*3, … , NP-1*NP, NP*1 were performed. We then derived lines by self-pollination during five generations in order to obtain F6 lines. NP mixed sub-populations of the same size (same contribution for all founders for this first generation) were derived, giving the “G0” generation. The population was considered as totally fixed, by sampling only one gamete after the last self-pollination. This stage corresponded to the end of a breeding cycle. The recombination procedure was based on randomly placed chiasmatas with no interference. Quantitative trait: The parameters for the creation of the quantitative trait are the QTL heritability (h²QTL) and the heritability of the polygenes (h²poly). We created at the founder line generation the NP allele effects for the QTL and the Npoly polygenes. The NP=20 possible effects of the QTL were drawn from a normal distribution with mean 0 and variance 1. Then the QTL variance (VarQTL) was calculated at the true QTL position, and the NP=20 effects for each of the Npoly polygenes were extracted from a normal law with mean 0 and variance [VarQTL*(h²poly/h²QTL)]/Npoly]. Finally, the true variance accounted by the polygenes was computed (VarPoly), and a random normally distributed noise with variance σ e2 = [VarQTL *(1/h²QTL-1) - VarPoly ] was added to simulate phenotypic values of the trait. Thus, the ratio of the additive variance explained by the QTL on the total phenotypic variance is exactly equal to the specified value h²QTL while the ratio of polygenic QTL on the total phenotypic variance could be slightly different from the specified h²poly. Hence, the allele effects and the environmental variance were created at the first generation,

14

and remained constant for all the generations even if the number of alleles decreased. Nevertheless, the environmental variance was adjusted at the last generation to set the desired QTL heritability before performing the QTL detection. Overlapping generations and matrix of crosses: When the genotype and phenotype of the lines were obtained, virtual breeding schemes have been performed. Two hypotheses were implemented by extrapolating information obtained by breeders: - The “overlapping” choice of the parents. All the parents were not necessarily extracted from the last generation only, but a proportion of them (parameter) could originate from older ones - The influence of a matrix of crosses on the structure of the resulting progeny of a cross breeding program, which really influenced the effective population size The design of crosses at the beginning of a breeding scheme could be seen as a geometric series, since the representation of parents in the selected progeny is uneven, L-shaped rather than random. For example, if a given line, say X, is the most cultivated line at a given period (with the best agronomic performance in a range of environments), X will usually be crossed to many other lines to fully exploit its genetic value. After self-pollination and selection, a certain number of lines coming from this parent X will still remain at the F6 stage, and will form one of the largest half-sib families of all the breeding scheme (containing possibly some full-sibs when a specific cross is particularly outstanding). On the other hand, an exotic plant with a really focused interest but with low agronomic performance may also be used to initiate crosses, but at a smaller scale. Some of its offspring will also probably be selected but at a much smaller extent. A matrix of crosses was implemented to reproduce the formation of half-sib and full-sib families during a cycle of breeding, hence taking into account unbalanced contributions of parents to the final population. This matrix was filled only upper diagonal, with the parents sorted from the best ones to the exotic ones. The choice and order of the parents to fill the

15

matrix were based on their anteriority. We considered that the best agronomic ones came from the closer generations of breeding, and that the exotic ones came from older generations of breeding. The “overlapping” option extracted 80% of parents from the closest generation of breeding, and 20% from all the older generations (accounting for 10% of the resulting progeny). On average, 285 crosses were performed per simulated selection cycle (out of 4750 possible crosses in the full matrix of crosses). Note that each cross gave, on average, 1.75 fullsibs and that each parent was found, on average, in ten progenies. Thus, each individual is related, on average, to 8.25 half-sibs and related to 0.75 full-sib. The resulting half-sib families sizes are presented in figure 2.

Figure 2 around here

Performing n breeding cycles: After the first generation, a loop on the number of breeding cycles was performed and the parents of crosses, the resulting progenies and the phenotypic data were stored. All available generations were used to build the next. The last breeding cycle, NG, was used for QTL detection. Note that at the beginning, all the allele frequencies were equal, which was not the case after many generations due to genetic drift and non-panmictic conditions. NP alleles with different effects were possible at each QTL locus and at each marker. All the markers and QTL were in full linkage disequilibrium at G0 but were not after many recombinations (5 generations of self-pollination by cycle, and NG cycles before the mapping generation). Thus, it is easy to anticipate that the information carried by markers and QTL will be different in many cases, and that ANOVA based method are likely to be unefficient. As the goal of this article is not to anticipate the influence of selection on the QTL detection, we did not performed recurrent pedigree selection on the value of the quantitative trait for the

16

choice of the parents, or intra-cross-selection during the self pollination process. We just wanted to anticipate the influence of the structure of the population and the effect of the design of crosses and breeding cycles for a non-selected trait, at non-selected locus.

Design of simulations:

Standard setting: as proposed by XU (1998), instead of performing simulations on a factorial design of every possible parameter combinations, we chose a standard setting for each parameter. The conditions of this standard setting were 21 chromosomes of length 100 cM each with 11 markers spaced every 10 cM, a QTL segregating at position 45 on chromosome 1 with heritability 0.1. Npoly=40 polygenic QTL were randomly placed on the other chromosomes setting a total genetic heritability (h²g) of 0.435 (with standard deviation 0.051 among the 100 repetitions). For each cycle, about 500 individuals were created at the F6 stage coming from 100 parents. The studied generation came from the 10th cycle of breeding, having considered NP=20 founder lines at the beginning of selection. Comparison of IBD formula (1) and IBD formula (2) for three levels of QTL heritability: Both formulae were tested for the same set of 100 independent replicates. Three different quantitative traits per replicate were created at the last generation of breeding scheme with QTL heritabilities 0.05, 0.1 and 0.2. Total genetic heritability (h²g) was initially set to 0.5 but after 10 generations of crosses (and genetic drift), some stochastic differences between replicates appeared. Thus, h²g values are 0.427 (0.063 standard deviation among 100 replicates) for h²QTL=0.05, 0.435 (0.051) for h²QTL=0.1, and 0.456 (0.047) for h²QTL=0.2. Different experimental conditions: We first tested the robustness of formula (2) for different marker bias conditions. To avoid stochastic differences between the tested conditions, the 100 same replicates from the standard setting were used before adding bias to the marker information. Thus, differences between the results within those conditions can be directly

17

attributed to this added bias on marker information. Created biases included (1) missing marker information (NA) (10% of randomly placed NA in the progenies), (2) portion of nonIBD Alike-In-State (AIS) alleles (a portion of allele codes were changed to other existing allele codes in parents and progeny files to obtain 25% of AIS (i.e. not IBD) alleles in the whole genome). We then varied out parameters around the standard setting, changing only one parameter at a time: these parameters include (1) the total number of founder lines in the base population (NP: 40 vs 20); (2) the number of breeding cycles (NG: 20 vs 10) ; (3) the marker density (d: 20 vs 10) ; (4) the number of lines at the mapping generation (n: 250 vs 500) and number of parents (m: 50 vs 100) We also tested the 20 cM marker density condition (3) with 25% of AIS alleles in order to anticipate the effect of this added bias at a lower marker density. Formula (2) was used for the analysis of all these variations, except for the NG=20 case, where we tested both formulae. Some parameters remained constant for all the simulations: the total number of chromosomes (21), the number of polygenic QTL (40), the position of the QTL on chromosome 1 (45 cM except for the d=20cM case where the QTL position is 50 cM), the total genetic heritability originally set to 0.5.

RESULTS We report the results of the two step variance component analysis on the different settings below. Due to computer constraints, only every third centiMorgan was tested for the presence of a QTL. Under each condition, the detection was performed for 100 random replicates. Parameters estimates and their standard error are reported. The empirical thresholds were computed from the analysis of 1000 replicated data sets. In the special case of the estimation of the QTL position, we measured a confidence interval (CI) based on the LOD-Drop-Off

18

(LDO) method (LANDER and BOTSTEIN 1989) calculated for each significant replicate. With this interval, we calculated the frequency at which the true QTL location was included in the LOD- Drop-Off CI. We also measured the power that the true QTL position was included in the CI determined by 4 times the standard error of the estimated position. Finally, for the significant replicates and at the true QTL position, we calculated the bias on the QTL heritability induced by our method, for both formulae. We then computed the QTL heritability over an interval including the true QTL position.

Figure 3 around here

The average likelihood ratio test profiles over 100 replicates for the different settings explored are presented in Figure 3. As expected, we notice a strong influence of the magnitude of the QTL effect (i.e. the heritability of the QTL) on the LR profile (Figure 3a). Formula (2) which takes into account ancestor pedigree relationships as estimated by markers, to infer the IBD probabilities - highly outperforms formula (1) for the three levels of QTL heritabilities in terms of detection power. For subsequent simulations, we used formula (2) only. The method seems robust to biased information on markers (Figure 3b). Indeed, likelihood ratio test profiles under conditions supposed to be found in real breeding schemes (i.e. noninformative alleles or wrongly informative, biased map estimation) shows only small differences between the different tested conditions, except for the 10% missing marker information on the last generation where the LR profile lies below the profile obtained with complete marker information. Alike In State alleles that come from different founder lines – hence linked to different, non-IBD QTL alleles - have only a small influence on the LR curve. Figures 3c and 3d show the variations around the standard setting, for the number of founders, number of breeding cycles, total number of F6 and marker density. For a constant mean size

19

of the derived populations (10 F6 lines per parent), switching from 500 to 250 F6 generates a strong decrease in detection power (see graph 3c, lower curve). We also notice on this figure a very low influence of the number of breeding cycles (NG=20) and a strong effect of the marker informativeness represented by the number of founder lines (NP=40 against 20). Finally, we notice a low influence of the marker density, except on the smoothness of the curve which presents informativeness peaks close to the marker positions (Figure 3d). Nevertheless, with 25% of AIS alleles, the QTL detection power was found more sensitive to a decrease in marker density from a value of one marker every 20 cM downwards. With a marker every 10 cM, the decrease in detection power is not very large (see Figure 3b).

Table 1 around here

The ability of the method to accurately estimate the parameters of interest can be judged from the results presented in Table 1. The accuracy of the QTL estimated position increases with the switch from formula (1) to formula (2) and also, as expected, with higher QTL heritabilities. Nevertheless, formula (2) leads us to overestimate both QTL and total genetic heritabilities, more than formula (1) does. For the different designs explored, the smaller population size is the parameter which affects the precision of the QTL position estimate most. On the opposite, the higher number of founder lines at the beginning of crosses gives the best estimates, as it increases the chance to have polymorphic markers between the parents. The influence of the number of breeding generations on the accuracy of the parameter estimates also turned out to be small. Finally, we notice that the 20 cM density case yields accurate results even if the number of markers available to build the relationship matrix and infer the IBD is divided by two. Nevertheless, both heritability estimates are less accurate than with a 10 cM density map.

20

Table 2 around here

The empirical threshold values of LR test statistics over 1000 replicated simulations are reported in Table 2. For all the designs, the critical values are nearly equivalent. This is not really surprising as the number of parameters being tested in the random model strategy remains the same. We also report in Table 2 the average LR test statistics and the power estimates for a Type I error α=0.05 over 100 replicated simulations. First we notice that the value of the LR test statistic increased with the value of the QTL heritability and that the chance to detect QTL is increased by using formula (2). Biased marker information (non-IBD but AIS marker alleles, missing genotypes) does not really influence the detection, except for 10% of missing data where the loss in power reaches 14% in comparison to the standard setting. For the different experimental designs, a higher number of founder lines tends to increase the detection power. Finally, the simulations where d=20cM and NF6=250 give correct detection powers in comparison to the standard setting.

Table 3 around here

In Table 3, we report the size of the confidence interval determined by four standard deviations of the estimated position or the drop by one and by two LOD units. These CI have been established by only taking the significant runs into account. We also report the frequency at which the true position (i.e. 45 cM) is included in the delimited interval (percentage of inclusion). We notice that the drop by two LOD units gives the confidence interval that is closest to 95% for the percentage of inclusion. In most cases, this interval is larger or equal to the one delimited by four standard deviations. Nevertheless, as the LOD drop-off method does not

21

give symmetric intervals, it takes into account the information of the curve in a better way. We also notice that the drop off by one LOD unit gives appropriate confidence intervals only for the highest QTL heritability.

Table 4, 5, 6 around here

Table 4 shows, for the different levels of QTL heritabilities, the parameter estimates for the replicates that are significant under the empirical threshold. We notice that, in comparison to Table 1 which includes all the replicates, the overestimation of the QTL heritability is higher. However, the estimated position is more accurate when averaging over the significant replicates only, yielding a smaller standard deviation also. In Table 5, we present the QTL and total genetic heritabilities at the true QTL position to detect possible bias from the method under formula (1) and (2). We notice that formula (2) gives more correct estimates at the true QTL position than formula (1) for all levels of QTL heritabilities. As formula (2) gives almost unbiased estimates at the true QTL position, a corrective factor for the heritability could then be worked out as a function of the estimated position. We thus report in Table 6 an attempt to give more appropriate estimate of the heritability for this kind of methods. We built an “Averaged Confidence Interval Heritability” by averaging the heritability over the confidence interval around the position estimate (which is supposed to include the true QTL position), instead of taking only a point heritability estimate at the detected position. We notice that this corrective factor yields results very close to the true parameter value with formula (2) but underestimates both heritabilities with formula (1).

22

DISCUSSION Obviously, many statistical methods already exist to map QTL in inbred plant material; however, these methods mainly focus on a single bi-parental cross. Other methods have been developed to address more challenging population structures (XIE et al. 1998; XU 1998; Yi and XU 2001, for example). Nevertheless, these methods do not appear to be easily extended to highly fragmented populations, at any selfed or back-crossed generation, and coming from many different parents. They also do not take into account the possibility for alleles to be IBD if ancestor pedigrees are not available. In this study, we extended the QTL mapping methodology proposed by GEORGE et al. (2000) and based on a two-step IBD variance component approach, to typical plant breeding populations made up of selfed lines which may have either: one or two parents in common, parents related to each other or not related to each other. In this paper we studied the simplest possible scenario where no relevant information about the ancestors’ pedigrees was available. The power and accuracy of this method were assessed using simulated data mimicking conventional breeding programs in cereals, in an effort to reproduce actual conditions of marker and gene frequencies and linkage disequilibrium across the parental lines. In constructing the matrix of IBD probabilities, a more thorough use of the marker data was achieved by calculating the genetic similarities between the parents. The extent to which the matrix G was modified from formula (1) to formula (2) is quite large. The proportion of PIBD values equal to zero with formula (1) – those values between non-sib lines – and replaced by non-zero values, was equal to 91%, as calculated from the distribution of family sizes featured in Figure 2. The inferred relatedness patterns between non-sibs leads to a substantial improvement of the accuracy of the position estimates and of the QTL detection power. Nevertheless, a downside of this improvement is a stronger overestimation of the QTL heritability as we shall discuss below.

23

The method performed quite well for all the tested sources of bias (missing marker data, nonIBD alike-in-state alleles). Two complementary explanations can be put forward to explain the observed loss in statistical power in the settings with non-IBD AIS alleles, observed with the lower marker density: 1- the lower chance to have informative markers flanking the interval being scanned. Thus, informative flanking markers had to be fetched further apart on average, thus decreasing in turn the estimates accuracy of the putative QTL’s allelic state. 2- A higher proportion of alleles that are AIS but not IBD should also have generated an upward bias of some of most genetic similarity estimates between the parents. This, in turn, will have affected the estimates of IBD probabilities and concomitantly the additive relationships between individuals thereby generating an upward bias of their estimates for non-full-sibs. In our population design, there is a strong within family linkage disequilibrium that can be exploited by comparing the parent’s genotypes to the current F6, which accumulated relatively few cross-overs. Formula (1) is solely based on the utilization of this linkage disequilibrium, and is similar to that used by XIE et al. (1998). Formula (2) can be viewed, loosely speaking, as an attempt to merge, to some extent, several families together on the basis of the likelihood that the parents share the same alleles identical-by-descent at the putative locus. The power increase obtained by using formula (2) follows the same principle as that obtained by XIE et al. (1998) in his Table 7 when he switched from a 100 x 5 sampling strategy to a less fragmented 50 x 10.

The same comparison can be made to explain the increased detection power observed after 20 breeding cycles instead of 10. Due to the very uneven parental contribution to the crossing scheme at each breeding cycle, some genetic drift takes place regularly during the breeding

24

cycles leading to the loss of certain haplotypes. Cross-overs increasingly occur between chromosome blocks with similar haplotypes and are thus genetically ineffective. This could be compared to a situation where a smaller number of “effective” parents were used, which, for a constant progeny size, gives enhanced power (XIE et al., 1998). This explanation is confirmed by the comparison between the use of formula (1) and formula (2) for the same 100 replicates, for the NG=20 setting. Formula (2) anticipates the similarities between the parents in a better way, hence yielding better detection power. QTL heritability readjustment before mapping

In preliminary simulations, QTL heritabilities changed, from the founder line generation to that of the mapping F6 lines, due to random genetic drift and non-panmictic conditions in small populations, hence yielding different heritabilities at the last generation. What we are concerned with is the QTL heritability at the current mapping generation to allow sounder comparison with published results. It therefore seemed sensible to choose that one, as opposed to that at the founder generation, as an entry parameter for our study. It is also to conciliate a germplasm history effect and a resulting QTL heritability set as a parameter, that we did a systematic readjustment of this parameter at the end of the ten breeding cycles. In so doing, we did not compromise too much with the assessment of the germplasm history effect as far as allelic frequencies and the ratio of the QTL effect over the other polygenic effects are concerned. Overestimation of the QTL heritability and proposals for a corrected heritability

CHARCOSSET and GALLAIS (1996)’s conclusions about the overestimation of h2QTL by the R2 estimator, in the standard fixed-effect ANOVA framework, cannot be directly extended to our case, where random effects are fitted to our QTLs. In fact, if the model is known, REML estimates of σu2 and of σv2 are unbiased since the estimates of the fixed effects and the prediction of the random effects are unbiased – the “U” of the BLUE and BLUP acronyms.

25

However, in QTL mapping, the model is unknown inasmuch as we do not know whether there is a QTL or not, or if we assume that there is one, we do not know which locus must have its segregation’s effect fitted in the model. In this paper, we related the mean ĥ2QTL from all stochastic realisations, in addition to that of the significant QTL’s only. This allowed us to distinguish between the part of overestimation due to the Beavis effect (BEAVIS 1994) and other remaining sources of bias. There did remain some. However, when, in our simulations, h2QTL was estimated at the QTL’s real position only – which, again, one cannot do over real data - this bias disappeared with formula (2). This suggests that : 1- the uncertainty over the QTL’s position generates a substantial bias in the QTL heritability estimates by itself. Since the locus retained is the one that yields a model with the maximum likelihood ratio (Lmax), it is also likely to be the locus where a chance association with the residuals component of the phenotype is strongest. Thus, in terms of expectancy, the residuals vector will, on average, play towards either decreasing or increasing ĥ2QTL with the same frequency at the QTL’s real position, whereas it will play towards increasing it more often at Lmax, since it is, by definition, the locus most strongly associated with the phenotypes. 2- Since formula (2) recovers more of the real information, it was quite expected that ĥ2QTL, in this case would approach h2QTL more closely and that formula (1) would give an under-estimate, at the QTL’s real position.

We shall focus on formula (2) (which yields more accurate heritability estimates at the true QTL position) to propose a correction of the (over)estimated heritability at the position of maximum likelihood. At this stage, we focused on the significant replicates only to be representative of what we would find with real data. We plotted the LR curve and that of the

26

putative QTL heritability estimate from many replicates along the scanned chromosome. The highest LR at a certain position also corresponds to the highest detected QTL heritability as exemplified by the fact that both curves showed the same pattern, i.e. the estimated QTL heritability and the LR curve followed proportional values along the y-axis. This property can be mathematically demonstrated. For lower levels of QTL heritabilities, the position was poorly estimated for many replicates (high standard deviation of the position estimates for the set of significant replicates), and the QTL effects were overestimated. This was interpreted as being due to the Beavis (BEAVIS 1994) effect, because chance association between residuals and genotype can yield a maximum for the QTL heritability away from the true QTL position. On the contrary, when drawing the same curves for the replicates with the highest QTL heritability (i.e. 0.2), we observed the same pattern for the curves (they still follow each other, as expected), but the detected position was more often very close to the true QTL position (yielding a smaller standard deviation on QTL position estimates for the set of significant replicates). Thus, the variance component method applied to high QTL heritabilities did not yield such a bias on QTL heritability estimates. This was interpreted as due to a lower effect of the residual component, contrary to the case of lower heritabilities. To summarize, we pointed out that: (i)

the estimated QTL heritability curve follows the LR curve.

(ii)

formula (2) yields the correct QTL heritability estimate at the true QTL position,

(iii)

the poor heritability estimates at the detected position are due to wrong position estimates, and to the averaging of heritabilities, which were obtained at their own different position estimates (hence not at the true one).

We worked out a sort of Confidence Interval for the estimated heritability, based on the Confidence Interval for the QTL position estimate: bearing in mind that the expectancy of the

27

position where the estimated heritability is equal to the true one lies at the true QTL position. Thus a pair of loci that delimits a confidence interval (hence a set of positions) that entails the true QTL position will also delimit a corresponding set of estimated heritabilities – at the different likely QTL positions within this interval - that entails the true one. The expectancy of the heritabilities calculated in the middle of the interval will be higher than the real QTL heritability by a certain factor that only depends on the real QTL heritability and on the experimental design. On the other hand, towards the ends of the confidence interval, the heritability of the QTL would, on average, be underestimated, by a factor that depends not only on QTL heritability and experimental design but also on the chosen stringency of the confidence interval. Hence, a less stringent confidence interval will contain a higher proportion of underestimates. What we observed was that a 95% confidence interval (as roughly determined by a 2-LOD drop-off interval) seems to contain just the right proportion of under- and overestimates so that when we average the estimated heritabilities over the confidence interval, we obtain an unbiased estimate. Leads for improvement:

Taking all the markers to calculate the genetic similarities between the parents seems to be the most appropriate solution in order to calculate the π i, j at each putative locus. As for the polygenic term of the model, one may argue that the calculation of the matrix A (in v ~ (0,Aσv2)) could be more precise if it was based on the markers that are actually linked to some polygenes, i.e. to some QTLs, instead of using all the markers indiscriminately as we did in this study. Implementing this scheme would require a forward iterative search procedure whereby the first step of QTL mapping would be carried out with no polygenic term. The second step would add a polygenic term with A being equal to the arithmetic mean of the G matrices of the different positions of the QTL detected in step one. New QTLs could then be detected due to an increase in power, since some background genetic noise would have been

28

removed or at least, it is expected that the new QTL position estimates will be more precise. Hence, in the subsequent round these new position estimates would be used to update A. And the QTL search could be iterated until a convergence criterion is satisfied. This procedure is somewhat analogous to one option of the Composite Interval Mapping proposed by ZENG (1994) in which a best subset of markers was chosen by stepwise regression, then used as cofactors in the linear model adjustment in the QTL search. This procedure, though, can bring an advantage only if a few QTLs explain the genetic variation as opposed to many with a small effect, all over the genome.

There is still some scope for a more accurate and probably less biased estimation of the coefficients of co-ancestries between parents and individuals in order to better estimate the parameters of the model and increase the QTL detection power. A first attempt to improve the PIBD estimates could be to subtract from all genetic similarities an estimated proportion of alleles in common that supposedly unrelated lines have in common – by definition, these alleles in common would be alike in state only and not IBD. This method was suggested by MELCHINGER (1991). The use of STRUCTURE (PRITCHARD et al. 2000, FALUSH et al. 2003 for linked loci), for example, would allow to group parents according to a common selection history. Our genetic similarities between the parents would be replaced by the scalar products of parents’ decompositions between the inferred clusters. The remaining of the PIBD calculation would be identical. Likewise, matrix A could be computed from the same decomposition.

The proposed method has proved to be powerful in detecting medium sized QTL (h²=0.1) in a typical set of inbred lines from a complex pedigree, such as those created in common cereal breeding programs. The use of this kind of methods could increase the relevance and cost

29

effectiveness of quantitative trait loci mapping in applied contexts and could provide an alternative to the development of specifically designed recombinant population, by exploiting the genetic variation actually used by plant breeders. JANSEN et al. (2003) proposed to use parental haplotypes sharing to routinely map QTL in breeding populations. The use of haplotypes is challenging in this kind of methods where only little information is available on founders and on the relationships between parents. It is even more challenging with the increasing use of SNPs versus micro-satellite markers. A mixing of the proposed IBD method and the method proposed by JANSEN et al. (2003) would be a good solution for mapping QTL and properly estimate the haplotype effects, at a lower marking cost. One would first detect QTL within an IBD-based variance component framework at a low marker density then use a higher marker density for some QTL of interest to build haplotypes and estimate their effects within a fixed-effect framework. Such locally high density mapping could allow identifying the haplotypes of minimum length that have the most promising effect. Besides, it would directly provide markers to manipulate these haplotypes in breeding schemes. The methodology developed in this article is currently applied to the analysis of real wheat breeding data.

Acknowledgments: The authors are grateful to the Ministère de l'Economie, des Finances et

de l'Industrie for its financial support (ASG program n° 01 04 90 6058)

30

LITTERATURE CITED ALMASY,L., and J. BLANGERO, 1998 Multipoint quantitative trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62: 1198-1211. BEAVIS W.D., 1994 The power and deceit of QTL experiments: lessons from comparative QTL studies. American Seed Trade Association. 49th Annual Corn and Sorghum Research Conference. Washington D.C. BERNARDO, R., 1993 Estimation of coefficient of coancestry using molecular markers in maize. Theor. Appl. Genet. 85: 1055-1062. BINK, M. C. A. M., P. UIMARI, M. SILLANPÄÄ, L. JANSS, and R. JANSEN, 2002 Multiple QTL mapping in related plant populations via a pedigree-analysis approach. Theor. Appl. Genet. 104: 751-762. CHARCOSSET, A., and A. GALLAIS, 1996 Estimation of the contribution of quantitative trait loci (QTL) to the variance of a quantitative trait by means of genetic markers. Theor. Appl. Genet. 93: 1193-1201. COCKERHAM, C. C., 1983 Covariances of relatives from self-fertilization. Crop Science 23: 1177-1180. DONINI P., J. R. LAW, R. M. D. KOEBNER, J. C. REEVES and R. J. Cooke, 2000 Temporal trends in the diversity of UK wheat. Theor Appl Genet 100: 912-917. FALUSH D., M. STEPHENS and J. K. PRITCHARD, 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 15671587. GEORGE A. W., P. M. VISSCHER and C. S. HALEY, 2000 Mapping quantitative trait in complex

31

pedigrees: a two-step variance component approach. Genetics 156: 2081-2092. GILMOUR, A. R., B. R. CULLIS, S. J. WELHAM and R. THOMPSON, 1998 ASREML. Program User Manual. Ed. Orange Agricultural Institute, New South Wales, Australia. GIMELFARB, A., and R. LANDE, 1994

Simulation of marker-assisted selection in hybrid

populations. Genet. Res. 63: 39-47. GIMELFARB, A., and R. LANDE, 1995 Marker-assisted selection and marker-QTL associations in hybrid populations. Theor. Appl. Genet. 91: 522-528. HALEY, C.S., and S. KNOTT, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69: 315-324. HARRIS, D. L., 1964 Genotypic covariances between inbred relatives. Genetics 50: 13191348. HOSPITAL, F., L. MOREAU, F. LACOUDRE, A. CHARCOSSET and A. GALLAIS, 1997 More on the efficiency of marker-assisted selection. Theor. Appl. Genet. 95: 1181-1189. JANSEN, R. C., 1993 Interval mapping of multiple quantitative trait loci. Genetics 135: 205211. JANSEN R. C., J-L. JANNINK and W.D. BEAVIS, 2003 Mapping quantitative trait loci in plant breeding populations: use of parental haplotype sharing. Crop Sci. 43: 829-834. JIANG, C., and Z-B ZENG,, 1995 Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics 140: 1111-1127. KEMPTHORNE, O., 1955 The correlation between relatives in inbred populations. Genetics 40: 681-691.

32

KOROL, A., Y. RONIN, and V. KIRZHNER, 1995 Interval mapping of quantitative trait loci employing correlated trait complexes. Genetics 140: 1137-1147. LANDE R., and R. THOMPSON, 1990

Efficiency of marker-assisted selection in the

improvement of quantitative traits. Genetics 124: 743-756. LANDER, E., and D. BOTSTEIN, 1989 Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185-199. LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer Associates, Sunderland, MA. MALECOT G., 1948 Les mathématiques de l'hérédité, Ed. Masson et Cie, Paris. MELCHINGER A. E., M. M. MESSMER, M. LEE, W. L. WOODMAN and K. R. LAMKEY, 1991 Diversity and relationships among U.S. maize inbreds revealed by restriction fragment length polymorphisms. Crop Sci. 31: 669-678.

MOREAU L., S. LEMARIÉ, A. CHARCOSSET, and A. GALLAIS, 2000 Economic efficiency of one cycle of marker-assisted selection. Crop Sci. 40: 329-337. MURANTY, H., 1996 Power of tests for quantitative trait loci detection using full-sib families in different schemes. Heredity 76: 156-165. NEI M. and W. H. LI, 1979 Mathematical model for studying genetic variations in terms of restriction endonucleases. Proc. Natl. Acad. Sci. 76: 5369-5373. PRITCHARD J. K., M. STEPHENS and P. DONNELLY, 2000 Inference of population structure using multilocus genotype data. Genetics 155: 945-959. RODER, M. S. , K. WENDEHAKE, V. KORZUN, G. BREDEMEIJER, D. LABORIE, D. et al., 2002

33

Construction and analysis of a microsatellite-based database of European wheat varieties. Theor. Appl. Genet. 106: 67-73. SERVIN, B., C. DILLMANN, G. DECOUX and F. HOSPITAL, 2002 MDM a program to compute fully informative genotype frequencies in complex breeding schemes. J. Hered. 93(3): 227228. S-PLUS, 2000

S-PLUS guide to statistical and mathematical analyses. MathSoft,

Massachusetts Institute of Technology XIE, C., D. D. G. GESSLER and S. XU, 1998 Combining different line crosses for mapping quantitative trait loci using the identical by descent-based variance component method. Genetics 149: 1139-1146. XU, S. 1998 Mapping quantitative trait loci using multiple families of line crosses. Genetics 148: 517-524. XU, S., and W. R. ATCHLEY, 1995 A random model approach to interval mapping of quantitative trait loci. Genetics 141: 1189-1197. YI, N., and S. XU, 2001 Bayesian mapping of quantitative trait loci under complicated mating designs. Genetics 157: 1759-1771. ZENG, Z-B., 1994 Precision mapping of quantitative trait loci. Genetics 136: 1457-1468. ZENG, Z-B., 1993 Theoritical basis of separation of multiple linked gene effects on mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA. 90: 10972-10976.

34

TABLE 1 Estimates of the position, QTL and total genetic heritabilities; h²QTL and h²g respectively h² g

Position

ĥ2QTL

ĥ2g

0.427 (0.068)

49.44 (25.42)

0.067 (0.035)

0.427 (0.108)

46.45 (21.84)

0.085 (0.041)

0.432 (0.107)

47.76 (20.88)

0.099 (0.042)

0.442 (0.081)

45.52 (17.72)

0.129 (0.056)

0.450 (0.083)

44.91 (7.35)

0.192 (0.058)

0.451 (0.083)

45.73 (6.89)

0.228 (0.065)

0.467 (0.091)

45.52 (17.72)

0.129 (0.056)

0.450 (0.083)

47.15 (17.99)

0.135 (0.060)

0.445 (0.084)

48.55 (18.87)

0.131 (0.098)

0.434 (0.094)

Experimental design (1) QTL heritability and formula h²QTL=0.05

Formula (1) Formula (2)

h²QTL=0.1

Formula (1)

0.435 (0.051)

Formula (2) h²QTL=0.2

Formula (1)

0.456 (0.047)

Formula (2) (2) Different experimental designs Standard setting AIS 25%

0.435 (0.051)

NA 10% NP=40

0.451 (0.067)

48.42 (13.75)

0.136 (0.053)

0.446 (0.108)

NG=20 formula (1)

0.413 (0.059)

48.00 (18.75)

0.091 (0.039)

0.430 (0.091)

NG=20 formula (2)

0.413 (0.059)

46.89 (15.96)

0.145 (0.049)

0.436 (0.094)

d=20cM

0.433 (0.052)

52.27 (23.10)

0.149 (0.070)

0.395 (0.088)

d=20, AIS 25%

0.433 (0.052)

52.87 (24.80)

0.147 (0.076)

0.399 (0.087)

Npar=50, NF6=250

0.403 (0.077)

49.86 (23.25)

0.159 (0.070)

0.441 (0.149)

The standard setting is a QTL heritability of 0.1, a total genetic heritability (h2g) of 0.435 (with standard deviation 0.051 among the 100 repetitions) and a 10 cM marker density, with a QTL at 45 cM. About 500 individuals are created at the F6 stage coming from 100 parents for each breeding cycle. h2g is the result of the readjustment of the residuals in order to obtained a given target h2QTL. The studied generation comes from the 10th cycle of breeding, having considered 20 founder lines at the beginning of selection.

35

(1) Each set of runs differs from the simulated QTL heritability noted in column one, and

by the IBD formula tested. IBD formula (1) takes only into account Half-Sib and FullSib relationships while formula (2) adds ancestor relationships estimated by markers. (2) Each additional set of runs differ from the standard setting by the simulation

parameter noted in column one. Except the NG=20 case, all the other settings are tested with formula (2) only. Mean and standard deviations (in parentheses) are calculated among the 100 replicates.

36

TABLE 2 LR threshold, test statistic and QTL detection power

Threshold

Test statistic

Power (%)

h²QTL=0,05 formula (1)

4.08

4.62 (3.52)

47

h²QTL=0,05 formula (2)

3.96

5.23 (3.97)

58

h²QTL=0,1 formula (1)

4.08

7.75 (5.06)

71

h²QTL=0,1 formula (2)

3.96

9.76 (6.71)

80

h²QTL=0,2 formula (1)

4.08

22.03 (10.9)

100

h²QTL=0,2 formula (2)

3.96

23.69 (10.8)

100

Standard setting

3.96

9.76 (6.71)

80

AIS 25%

4.28

9.91 (6.56)

72

NA 10%

4.30

7.81 (5.53)

66

NP=40

3.70

11.29 (7.09)

89

NG=20 formula (1)

3.72

5.95 (4.15)

65

NG=20 formula (2)

3.62

10.62 (5.75)

91

d=20

4.25

9.21 (6.81)

75

d=20, AIS 25%

4.94

8.06 (6.50)

64

Npar=50, NF6=250

4.00

7.96 (3.66)

62

(1) QTL heritability and formula

(2) Different experimental designs

See Table 1 for the standard setting. Each additional set of runs differs from the standard setting by the parameter change noted in column one. Threshold represents the empirical threshold calculated for 1000 replicates. Test statistic is the mean and standard deviation of the maximum of LR test for the 100 replicates. Power is the percentage of replicates with max LR exceeding the empirical threshold.

37

TABLE 3 Accuracy of three methods used to infer Confidence Intervals (CI)

CI- 4 S.D. % 4 S.D. CI- 1 LDO % 1 LDO CI- 2 LDO % 2 LDO (1) QTL heritabilities and IBD formula

h²QTL=0,05 formula (1)

86.40

91

41.78

79

86.66

97

h²QTL=0,05 formula (2)

72.20

93

41.18

79

85.43

97

h²QTL=0,1 formula (1)

63.28

90

29.10

82

74.75

99

h²QTL=0,1 formula (2)

63.40

93

31.04

86

66.62

96

h²QTL=0,2 formula (1)

29.40

96

17.26

90

34.33

97

h²QTL=0,2 formula (2)

26.00

98

15.21

91

28.21

97

(2) Different experimental designs

Standard Setting

63.40

93

31.04

86

66.62

96

AIS 25%

63.84

92

26.4

80

62.9

97

NA 10%

63.24

91

28.48

80

67.6

95

NP=40

52.24

94

26.66

85

60.34

97

NG=20, formula (1)

67.60

91

43.95

72

82.80

96

NG=20, formula (2)

62.20

91

27.30

76

66.44

97

D=20

93.60

96

36.35

75

73.96

94

D=20, AIS 25%

94.56

95

35.08

67

75.61

91

Npar=50, NF6=250

88.16

89

45.61

79

82.82

95

See Table 1 for the standard setting. “CI- 4 S.D.” represents the size of the CI calculated as four times the standard deviation on the estimated QTL position for the significant runs, and “% 4 S.D.” represents the percentage of times that the true position is included within the delimited interval. “CI- 1 LDO” and “CI- 2 LDO” represent the size of the CI as determined by the drop of one and two LOD unit (multiply by 2*ln(10) to convert in LR units)

38

respectively. “% 1 LDO” and “% 2 LDO” represent the number of times that the true position is included within the delimited interval.

39

TABLE 4 Estimates of the QTL parameters for different levels of QTL heritability under the empirical threshold

Experimental design

h² g

Position

ĥ2QTL

ĥ2g

0.427 (0.068)

47.55 (21.61)

0.092 (0.032)

0.429 (0.110)

45.98 (18.05)

0.110 (0.030)

0.437 (0.109)

49.56 (15.82)

0.117 (0.035)

0.452 (0.081)

45.56 (15.85)

0.143 (0.051)

0.450 (0.08)

44.91 (7.35)

0.192 (0.058)

0.451 (0.083)

45.49 (6.50)

0.230 (0.062)

0.471 (0.081)

QTL heritability

h²QTL=0.05

Formula (1) Formula (2)

h²QTL=0.1

Formula (1)

0.435 (0.051)

Formula (2) h²QTL=0.2

Formula (1) Formula (2)

0.456 (0.047)

Each set of runs differs by the simulated QTL heritability noted in column one, and by the IBD formula tested. Mean and standard deviations (in parentheses) are reported only for the significant replicates (noted in the “Power” column of Table 2) among 100.

40

TABLE 5 Estimates of the QTL and total genetic heritabilities at the true QTL position (45) h²QTL =0.05, h2g=0.427

h²QTL =0.1, h2g=0.435

h²QTL =0.2, h2g=0.456

Formula

ĥ2QTL

ĥ2g

ĥ2QTL

ĥ2g

ĥ2QTL

ĥ2g

(1)

0.036 (0.037)

0.426 (0.106)

0.068 (0.048)

0.442 (0.080)

0.191 (0.056)

0.455 (0.081)

(2)

0.051 (0.047)

0.430 (0.108)

0.094 (0.064)

0.447 (0.080)

0.225 (0.066)

0.471 (0.094)

Formula (1) and (2) are tested for three levels of QTL heritability at the true QTL position. h2g is the result of the readjustment of the residuals in order to obtained a given target h2QTL.

41

TABLE 6 Averaged Confidence Interval Heritability for the significant replicates over the 2 LodDrop-off units interval heritabilities

h²QTL =0.05, h2g=0.427

h²QTL =0.1, h2g=0.435

h²QTL =0.2, h2g=0.456

No-Selection

ĥ2QTL

ĥ2g

ĥ2QTL

ĥ2g

ĥ2QTL

ĥ2g

(1)

0.049 (0.028)

0.405 (0.104)

0.072 (0.039)

0.434 (0.083)

0.158 (0.062)

0.438 (0.085)

(2)

0.054 (0.034)

0.418 (0.109)

0.100 (0.063)

0.438 (0.081)

0.194 (0.067)

0.459 (0.082)

Formula (1) and (2) are tested for three levels of QTL heritability. h2g is the result of the readjustment of the residuals in order to obtained a given target h2QTL.

42

Founder lines: full linkage disequilibrium marker-QTL

1 Q1 1

2 Q2 2

3 Q3 3

NP QNP NP

Creation of the QTL and polygene effects circular crosses

G0 generation

NP sub-populations P1 P2 P3 P4 (…)

matrix of crosses

For i=10 breeding cycles Gi generation

P99 P100

*

500 F6 lines *

1 to 4 progenies per cross + halfsib relationships

* *

P1

G10 generation

FIGURE 1

P2

P79 P80

+ P81-P100 : overlapping

QTL Detection

80

Number of progenies / parent

70

60

50

Mean

40

Standard Deviation

30

20

10

P9 6

P8 6

P7 6

P6 6

P5 6

P4 6

P3 6

P2 6

P1 6

P6

P1

0

Parent name

FIGURE 2

44

FIGURE 3

45

FIGURE LEGENDS: FIGURE 1: Simulation of the breeding scheme

FIGURE 2: Mean and standard error for 1000 replicates created by the “matrix of crosses”

function for the half-sib family size under the standard setting (100 parents, 500 F6).

FIGURE 3: Comparison of the LR profiles for (a) different levels of QTL heritabilities for

formula (1) and (2), (b) different biased marker information for the standard setting under formula (2), (c) difference in the breeding schemes compared to the standard setting, and (d) a density of one marker every 20 cM, with 25% of AIS non-IBD alleles.

46

Full research paper

Marker-assisted introgression of 4 Phytophthora capsici resistance QTL alleles into a bell pepper line: validation of additive and epistatic effects

A. Thabuis1, A. Palloix1, B. Servin2, A.M. Daubèze1, P. Signoret1, F. Hospital2 and V. Lefebvre1

1

INRA, Genetics and Breeding of Fruits and Vegetables, BP94, 84143 Montfavet cedex, France.

2

INRA, UMR de Génétique Végétale, Ferme du Moulon, 91190 Gif sur Yvette, France.

Submitted with 5 tables and 4 figures.

Corresponding author: Véronique Lefebvre

INRA, Genetics and Breeding of Fruits and Vegetables, BP94, 84143 Montfavet cedex, France. E-mail: [email protected] Phone number: +33 (0)4 32 72 28 06 Fax number: +33 (0)4 32 72 27 02

1

Abstract The aim of the present study is to transfer resistance to P. capsici alleles at four quantitative trait loci (QTLs) from a small fruited pepper into a bell pepper recipient line thanks to markers. The marker-assisted selection program was initiated from a doubled-haploid line issued from the mapping population and involved three cycles of marker-assisted backcross (MAB). Two populations derived by selfing the plants selected after the first selection cycle were genotyped and evaluated phenotypically for their resistance level. The additive and epistatic effects of the four resistance factors were re-detected and validated in these populations, indicating that introgression of 4 QTLs in this MAB program was successful. A decrease of the effect for the moderate-effect QTLs and of the epistatic interaction was observed. Phenotypic evaluations of horticultural traits were performed on sample of each backcross generation. The results indicated an efficient return to the recipient phenotype using this MAB strategy.

Key words Capsicum annuum L., Disease resistance, Epistasis, Horticultural traits, Marker-assisted selection, QTL

Introduction Phytophthora capsici, causing root rot and shoot blight, is one of the most devastating field or greenhouse diseases of pepper crop worldwide. This soilborne Oomycete is able to attack the plant at any developmental stage causing sudden wilt and the collapse of the plant. Soil chemical treatments have technical limitations and would be progressively banished because of the environmental legislations. Breeding for P. capsici resistance remains a relevant challenge. Several sources of resistance were described in intraspecific pepper germplasm but all displayed a partial effect and were found in exotic accessions. Among them, Perennial is an Indian line displaying a polygenic resistance (Lefebvre and Palloix, 1996) but is small-fruited and pungent. The polygenic resistance to P. capsici was dissected into 4 resistance components using 2 phenotypic tests performed in controlled conditions. Three resistance components (REC: receptivity, IND: inducibility and STA: stability) were quantitatively evaluated thanks to the stem inoculation procedure and revealed different steps of the adult plant-pathogen interaction (Pochard et al., 1976; Pochard and Daubèze, 1980). The Root Rot Index component (RRI) was a semi-quantitative criterion based on the evaluation of the resistance after root inoculation of young plantlets (Palloix et al., 1988). Many pepper breeding programs focused on breeding for resistance to P. capsici into large fruited cultivars. However, they were not fully successful since the released varieties displayed only a weak resistance level. In order to enhance the global resistance level, Palloix et al. (1990) initiated a first phenotypic recurrent breeding scheme to cumulate resistance factors from distinct accessions. A second phenotypic recurrent breeding scheme was set up to transfer the polygenic resistance into a favourable genetic background (Palloix et al., 1997). Three to six selection cycles enabled to transfer an intermediate resistance level into a large fruited pepper. However, attempts to transfer a higher resistance level decreased the genetic advance for horticultural traits. The advent of molecular markers enabled to identify the chromosomal regions involved in the variation of the components of the P. capsici resistance of Perennial and to estimate their individual effects.

2

This analysis was conducted using 114 doubled haploid (DH) lines issued from the cross Perennial x Yolo Wonder (YW). A total of 5 different genomic regions displayed an additive effect on resistance. Four major epistatic relationships were detected between either additive QTLs or between QTLs involved only in epistatic relationships (Lefebvre and Palloix, 1996; Thabuis et al., 2003). Marker-assisted selection appeared as a promising tool for breeding quantitative resistance. Regarding the genetic distance between Perennial and bell pepper accessions (Lefebvre et al, 2001), the marker-assisted backcross strategy (MAB) appeared as the most suitable to transfer a limited number of QTLs. However, given the imprecision around the positions of the QTLs, Hospital and Charcosset (1997) showed that MAB needed to be optimised for a successful QTL transfer. They proposed a two-fold strategy: (i) selection for the donor alleles on the carrier chromosomes (foreground selection) and (ii) in the remaining plants, selection for the return to the recipient parent (background selection). Through their theoretical study, they showed that three markers spread along the confidence interval of each QTL enabled an efficient control of the QTLs during the introgression. Once the interval lengths and marker locations were defined, they computed the minimal population size for recovering at least one plant having the entire donor segments for a given type-I-error. We conducted a MAB program to transfer favourable alleles at the 4 main QTLs controlling resistance to P. capsici from Perennial accession into YW, a bell pepper line, by taking into account the optimisations from Hospital and Charcosset (1997). To speed up the breeding process, a DH line from the mapping population, having all the chromosomal regions to be transferred (Thabuis et al., 2003), was used as the donor parent to initiate the MAB program. In this paper, we examined (i) the results of 3 MAB cycles conducted according to the theoretical optimisations, (ii) the additive and epistatic effects of the transferred segments in validation populations, and (iii) the impact of the background selection step on the improvement of horticultural traits.

Materials and methods Plant material and breeding scheme Backcross populations: DH285, a DH line issued from the initial mapping population, was chosen as the donor parent (Fig. 1): it possessed the 4 favourable alleles to be transferred and a YW genome content of 3.6% on the QTL-carrier chromosomes and 43.8% on the non-carrier chromosomes (Lefebvre et al., 2002). Two of the favourable alleles to be transferred were linked (approximately 20 cM) on the chromosome P5 so that they were considered as a single segment during the MAB process whereas the 2 others mapped on P2 and P10. To increase the success to transfer favourable alleles at the QTLs, the 3 QTL-carrier chromosomes of the DH285 line are almost entirely of the Perennial phase (Fig. 2). As indicated on Fig. 1, 3 MAB cycles were performed. The BC1, BC2 and BC3 populations were firstly screened with markers linked to and in coupling with the 4 P. capsici resistance alleles at the QTL, and finally for the recovery of the recipient genetic background (as described below in molecular analyses). QTL validation populations: Two validation populations were derived from the backcross populations in order to evaluate accurately the QTL effects in the recipient genetic background (Fig. 1). A large progeny, named BC1S1_AE, composed of 620 plants, was derived by selfing the plant selected after the first MAB cycle, containing the 4 resistance alleles at the QTLs. This population was genotyped with the

3

QTL markers and was evaluated phenotypically for P. capsici resistance level, in order to evaluate QTL additive effects. A second validation population was aimed at evaluating the epistatic relationship between the 2 additive QTLs on P5 and P10. Two BC1 plants (BC1_251 and BC1_301) carrying both segments on P5 and P10 involved in interaction and the most efficient return to the recipient parent were used. A progeny of 620 plants, named BC1S1_EE, was derived by selfing these two BC1 plants. It was genotyped with markers of the P5 and P10 QTLs, and assessed for P. capsici resistance level. In addition, the resistance level of the selfed progeny of the selected BC2 plants (BC2S1) was assessed to evaluate the performance of the MAB strategy. Resistance evaluation The moderately aggressive P. capsici strain S101 was used for phenotypic assays (Lefebvre and Palloix, 1996). It was maintained as described by Clerjeau et al. (1976). P. capsici resistance was evaluated using the stem inoculation test. The plants were decapitated at the 6th/7th-leaf stage and a mycelium plug was placed on the fresh section of the stem. The plants were kept in growth chambers under the conditions described in Pochard and Daubèze (1980). The measure of length necrosis during 21 days supplied 3 resistance components, as described by Lefebvre and Palloix (1996). Receptivity (REC, mm.d-1) measured the pathogen spread in early infection process (3rd day post inoculation, DPI). Inducibility (IND, mm.d-2) measured the deceleration of the necrosis length between the 3rd and the 10th DPI. Stability (STA, mm.d-1) measured the average speed of necrosis length between the 14th and the 21st DPI. The lower the value of the resistance component, the higher the resistance level. In all tests, controls were YW, Perennial, DH285 and PI201234 (10 plants per genotype). Horticultural trait evaluation The horticultural traits of samples from each backcross generation (BC) were evaluated. The BC1 and BC2 populations were evaluated in 2000 and 2001. The BC3 population was evaluated only in 2001. Every year, trials were placed in 2 cultivation conditions: under cold plastic greenhouse conditions from September to January at Almeria (Spain), and under field conditions from July to November in Sicily (Italy). For each BC population, 50 sampled plants were evaluated in a complete randomised design of 5 blocks of 10 plants. For the control inbred lines, 5 blocks of 5 plants each were included. The controls were YW, Perennial, and DH285. The horticultural traits evaluated were the length of the main axis (AL in cm, from cotyledon to the st

1 flower) and the number of leaves (NL) on this axis which permitted the calculation of the internode length (IL=AL/NL). For the fruit traits, 5 to 10 mature fruits were harvested from each plant and weighted together. The average fruit weight per plant (AFW in g) was computed. The most representative fruit from each plant was chosen to measure the fruit length (FL in mm) and fruit width (FW in mm), the fruit flesh thickness (FT in mm) and to calculate the fruit shape (FS=FL/FW). All these horticultural traits clearly discriminated Perennial from YW. Molecular data The DNA was extracted thanks to the microprep protocol described by Fulton et al. (1995). For the foreground selection step, 3 to 4 markers (3 for QTLs on P5 and P10, 4 for QTL on P2) were used for controlling each segment. One marker was located close to the most likely position of the QTL while the others were controlling the QTL position support interval defined by a LOD drop-off of 1.5 (Fig. 3). A total

4

of 10 markers were used for the selection of the 4 resistance QTLs. Markers included 3 RAPDs (further converted into 1 SCAR and 2 CAPS), 5 AFLPs, 1 ISSR and 1 RFLP. The dominant markers used were in coupling with the resistant alleles at the QTLs. The RAPD markers R08_1.9p, C01_1.25p and F11_0.65p, revealed as described by Lefebvre et al. (1995), were used for the first MAB cycle. For the two other MAB cycles, R08_1.9p and C01_1.25p were converted into codominant CAPS markers (ASC037 and ASC031, respectively) whereas F11_0.65p was converted into a dominant SCAR marker (ASC035p). Five AFLP markers, revealed as described by Vos et al. (1995) and named E35M61-115p, E37M59-351p, E40M48108p, E40M55-105p and E42M55-Fp, were used to assess the presence of the favourable alleles at the QTLs. The RFLP marker TG586 was revealed as described by Lefebvre et al. (1993) using EcoRI as a restriction enzyme. The other markers provided by the 5 previous AFLP primer combinations and 11 additional ones (E37M54, E37M60, E37M62, E38M48, E38M61, E39M48, E39M49, E41M49, E41M51, E43M53 and E45M58) were used for the background selection. For the BC1S1_AE population, the 3 specific PCR markers as well as 8 AFLP markers generated by 3 primer combinations (E37M59, E43M53 and E35M61) were assayed. These 8 markers mapped to the confidence interval of the QTLs transferred and were presented on Fig. 3. For the BC1S1_EE population, 2 specific PCR markers, that mapped to both regions displaying interactions (ASC037 on P5 and ASC035p on P10), and the AFLP marker E40M55-104y, that mapped to P10, were assayed. Statistical analyses Plant selection The molecular data from the MAB program were used for selecting plants carrying the donor alleles at the 3 introgressed intervals and displaying the best return to the recipient parent. To estimate more accurately the return to the recipient parent, the precision graphical genotypes were performed using ‘MDM’ software elaborated by Servin et al. (2002) which computes the probability of the donor allele presence in each point of the genome. ‘GRAFGEN’ software was used for ‘precision graphical genotypes’ design, according to the results of MDM (Servin and Hospital, Submitted, http://moulon.inra.fr/~servin/grafgen). Validation of the additive and epistatic QTL effects The additive QTL validation was performed in BC1S1_AE by one-way ANOVA where the phenotypic resistance value is explained by a single-marker effect, with PROC GLM of the SAS package (SAS Institute, 1989). The 11 markers assayed for the validation step were mapped with a LOD threshold of 3 and a maximum recombination rate of 0.3 using ‘Mapmaker’ software (Lander et al., 1987). The QTL validation was also performed using Interval Mapping (IM) method with ‘QTL Cartographer’ software (Basten et al., 1997), that provided the most likely position of the QTLs, their effect (R2) and the additive and dominance effects (a, d). The dominance ratio ⏐d/a⏐ was also estimated for each QTL. The phenotypic means for the 3 resistance components were computed for each genotypic class using PROC GLM and ‘lsmeans’ option, and compared using ‘tdiff’ option with a type-I-error of 5%. The digenic interaction effect between ASC037 on P5 and 2 markers on P10 (ASC035p and E40M55-104y) was tested using a 2-way ANOVA in the BC1S1_EE population with two additive marker effects and an interaction factor between both markers (PROC GLM). As the 2 markers used for P10 checking were dominant and originated from distinct allelic phases, it enabled to deduce the genotypic class

5

at the P10 QTL (in the absence of recombination events). The phenotypic means for the 3 resistance components were computed for each genotypic class (‘lsmeans’ option) in order to compare the 9 allelic combinations (‘tdiff’ option with a type-I-error of 5%). Phenotypic analysis of the BC populations Resistance evaluation of the BC1S1_AE and BC2S1 populations were performed at 2 different years. Thus the individual plants values were adjusted to the controls values (PROC REG) after checking the variance homogeneity. A one-way ANOVA was performed where the resistance component value of a plant is explained by the single MAB cycle effect (PROC GLM). The adjusted means per cycle were computed and compared for the 3 resistance components using ‘lsmeans’ and ‘tdiff’ options with a type-I-error of 5%. The main source of variation affecting horticultural traits were analysed on the data collected from the 4 trials (2 years, 2 locations) with the following ANOVA model: Pijkl = µ + Bi + Lj+ Yk + (BxL)ij + (BxY)ik + Rijkl (PROC GLM), where Pijkl is the horticultural trait value of the plant l, Bi the effect of the backcross population i, Lj the effect of the location j, Yk the effect of the year k, (BxL)ij the interaction factor effect between the backcross population i and the location j, (BxY)ik the interaction factor effect between the backcross population i and the year k, Rijkl the residual effect. Trials were also analysed separately, for each location and each year with the following ANOVA model: Pijk = µ + Bi + bj + (Bxb)ij + Rijk (PROC GLM), where bj is the effect of the block j, and (Bxb)ij is the interaction factor effect between the backcross population i and the block j. An effect was declared significant if P