The confusions of genetic relatedness

By Razib Khan | September 12, 2010 12:12 pm

Last spring I posted ‘Beyond visualization of data in genetics’ in the hopes that people wouldn’t take PCA too far in assuming that the method was a reflection of reality in a definite fashion. Remember, PCA visualizations are showing you two, and at most three, dimensions in genetic variation within the data set at any given time. The fine print is important; e.g., “PC 1 15%”, “PC 2 4.5%”, etc., which points to the magnitude of the dimensions within the data. You see the largest, and likely historically most significant on a population wide scale, genetic variances, but there’s still a large remainder left over. But when I look at referrals from message boards people obviously aren’t careful with what PCA is telling them.

As an illustration, in the 23andMe user interface you can “compare genes” genes across people who you “share genes” with. This comparison operates over ~550,000 single nucelotide polymorphisms out of 3 billion base pairs (you can constrain it to traits, but I’m going to talk about the comparison to the whole data set below). For example, a man of European descent shares 83.2% with his daughter, who is Eurasian (the mother is Burmese, with some recent Indian admixture). Another man of European descent shares 84% with his daughter, whose mother is also European (in fact, both parents are western European). The “gene sharing” with other people of European descent of these two men is in the 75-74% range (for reference, a Chinese person is 71%, and Nigerian 68.5%). On the PCA plot the European and his Eurasian daughter are very far apart, while the European man and his European daughter cluster together. What you’re seeing on the PCA chart is population level information, not the genetic uniqueness within families and across parents and offspring.

To further explore this issue, I thought it would be interesting to revisit my own genetic data. If you read my previous post, you will know it is not boring. As an ethnic Bengali my ancestry comes from the northeast of the Indian subcontinent, so in addition to the “Asian” fraction which most South Asians have in the 23andMe “ancestry painting” (around 25% on average, with a range from 10-35% probably the extremes within two standard deviations from what I can tell), I likely have some southeast Asian ancestry from Burma. 23andMe has three “reference” populations it uses from the HapMap:

Asian = Chinese/Japanese
European = Northwest European
African = Yoruba

All of us get an ancestry painting which is a combination of these three. Unfortunately unless you’re a relatively straightforward combination of these three groups it isn’t always too informative. So if you’re African American you should be in luck since the two ancestral populations which you derive from are included as reference populations. On the other hand, unadmixed Native Americans tend to be about 25% European and 75% Asian, while unadmixed South Asians are 75% European and 25% Asian. That’s because the allele frequencies in these two populations have some relationship to both the reference groups, even if there hasn’t been any recent admixture (additionally, the painting presumably misses a lot that is distinctive to these groups, though 23andMe has a feature which allows people to explore possible Native American ancestry specifically).

As I told you before my ancestry is 57% European and 43% Asian. This is a very large Asian fraction for a South Asian, and after comparing notes with other South Asian 23andMe customers I’m pretty sure that my large fraction is due to having admixture from Burmese and/or Tibeto-Burman or Austro-Asiatic “Hill Tribes” to the north, south and east of Bengal. Since my family is from the east of Bengal that is not too surprising.

You know from my previous post that on the PCA plot I am near, but outside, of the main South Asian cluster. But there’s some interesting data from the gene comparison feature too. For reasons of privacy I’m not going to give you names obviously, but, I will label people by geographical origin if I know that aspect of the individual’s information. Additionally, below the comparison is mostly to Indians, and so I’m going to substitute names of Indian states for those where I have that level of specificity. I also restandardized the gene sharing value, so that the nearest individual with whom I’m sharing is 0 , and the furthest on the plot is 1 (74.5% to 73.04% if you’re curious). To add a wrinkle, I’ve added the % Asian calculated from 23andMe’s ancestry painting on the Y axis. The two images below show the results, the first includes some East Asians and a European, while the second includes only South Asians.

no images were found

The first image is of more interest. Two points:

1 – Unlike most South Asians I have greater gene sharing identity with East Asians than with Europeans. The South Asian to whom I am closest to does not exhibit my own pattern, as they are closer to some Europeans than they are to some Chinese. In contrast, I not only unequivocally share more genes with East Asians than Europeans, but, I share more genes with some East Asians than I do with the individual from Iran, and, one South Asian from the northwest of the subcontinent and another from southern India. This last pattern is very peculiar from what I’ve been told (the other Bangaldeshi has the same tendency, though not to the same extent).

2 – There is a woman with whom I am sharing genes with from Burma. Her father, who died when she was young, had Indian ancestry, and reputedly spoke Tamil. She is ~20% European, which would make her father ~40% European. I have not seen a South Indian who is less than 65% European, so I believe that he had native Burmese admixture. If his mother was Burmese that would make his father ~80% European, which I have seen in a few South Indians, though their usual range seems to be 75-65%. Note that I am closer to her than I am to most South Asians. In contrast, the Bangaldeshi with whom I am sharing genes, and has the second highest percentage of Asian in their ancestry is about as far from this woman and he is from the Punjabis in terms of distance (in contrast, the Punjabis are about 2.5 times further than she is from my own genetic state).

7419_133883902983_699392983If I did the same plot of % Asian with gene sharing for the European man and his Eurasian daughter I would see a pattern whereby for most of the data there would be a noticeable linear pattern, the more Asian, the less gene sharing. The exception would be his daughter, who would be greatly Asian, but would be the closest by this genetic distance measure. Similarly, the Burmese woman with some Indian admixture is an outlier on my plot. The South Asians follow a southeast-to-northwest range of distance from me, with a rough, but not perfect, correspondence with Asian ancestry. Among the South Asians the individual from Bihar is an exception, just as the Burmese woman is. Why? From previous comments I’ve made I have indicated that there is a high probability of recent Burmese ancestry on my paternal lineage (specifically, my paternal grandfather, whose physical appearance is always described as atypical for a Bengali. My paternal grandmother was from a Hindu family which converted, and she looked stereotypically Bengali). Additionally, I know my mother’s maternal grandfather is from the Indian state of Uttar Pradesh, specifically, the region of Delhi. But I also know that before they were Muslim my maternal grandfather’s family were of the Hindu Kayastha caste. The individual from Bihar is a Kayastha, and for those of you who do not know, Bihar is the state just to the west of Bengal. I do not know if the Kayasthas share any deep genetic affinity or not, but I recall that Reich et al. observed a high degree of genetic evidence of endogamy in South Asia. So, just as I believe that I share Burmese-specific genetic variants with the woman of predominant Burmese origin which are not showing up in the simple ancestry estimates based on the global reference populations, I may also share Kayastha-specific variants which results in my genetic closeness to the Bihari individual. But my confidence in the latter conjecture is far weaker than in the former case.

In reviewing all I’ve said so far I suppose the moral of the story is not to trust too deeply in one set of data visualizations or summary statistics. Granted, some people have axes to grind and can find what they want in the science, my posts on Jewish genetics indicates that very strongly. But if you’re genuinely interested in patterns of variation, and your own place within the broader framework, you need to open different windows on the same data to get a truly fully-fleshed out understanding of the nature of things. If you are of an understudied population, and of somewhat mixed background, as I am, tread lightly and carefully. If you are of a well studied and characterized population, then learning you are 100% European is basically worthless (though some of the more detailed PCA’s can tell you some things).

CATEGORIZED UNDER: Genetics, Genomics
  • Pingback: Tweets that mention The confusions of genetic relatedness | Gene Expression | Discover Magazine --

  • deadpost

    It would be interesting if a social science study was done to look at people’s perceptions and this sort of genetic relatedness.

    For instance, if you had photographs of a sample of individuals volunteering for such a study and had a random group of people rate pairwise similarity of their faces on a scale of 1 to 10, or even ask a question like how similar (or ease of “passing for”) a “European”, “African” or “East Asian” does individual X look, on a scale of 1 to 10?

    Would you then might see people’s social categorizations match such genetic groupings (though I’m not familiar enough with stats to know how’d it be done)– if no one has done it, it would definitely be interesting.

    On a related note, I know the old-school craniometric groupings of
    “races” are considered obsolete, but have you seen any studies that have actually used morphological-type data (eg. nasal breadth, head width etc.) as variables for a multivariate anaylsis like PCA, with genetic maps, and looked at the similarity? It’d be insightful too to look at where “looks can be deceiving”. An analogue would be those phylogenetic trees where zoologists compare morphological with molecular characters.

    Sorry if my comments are a bit wordy.

  • onur

    I think we should see more samples from the Indo-Aryan-speaking populations of Bangladesh and the surrounding Indian territories (including the northeasternmost Indian territories) as all of your known ancestors were Indo-Aryan-speaking (Hindi in your mother’s maternal grandfather’s known [not putative] ancestors and Bengali in all of the rest of your known ancestors).

  • Razib Khan

    in The Rise of Islam and the Bengal Frontier, 1204-1760 the author argues that the high % of muslims in eastern bengali is largely due to the relatively late settlement by indian civilization, specifically, the dominant group was muslim by this period. instead of being ‘sanskritized’ as further west, the native tribal groups were ‘islamicized.’ the model here is basically similar to northeast india, where the british fostered christianity among hill peoples which were not yet under the influence of hindu or muslim indian civilization. similarly, in eastern bengal the argument goes that islam was powerful precisely because hinduism was very weak or non-existent among the local peoples (bengal was one of the last regions of south asian where buddhism flourished as well).

    here’s the rub for genetics: who were the tribal people in eastern bengal? it may be a substantial proportion were austro-asiatic types who were the substratum across much of southeast asia. so one need not posit recent ancestry:

  • onur

    We should also see more samples from all sorts of non-Indo-Aryan-speaking populations of Bangladesh, northeastern parts of India (including those west and southwest of Bangladesh) and even various regions of Burma (Myanmar) and Tibet.

  • Razib Khan

    page 19. two of the groups you see deviated toward the chinese are austro-asiatic speaking tribes. one of whom, the santhal, have a presence in bangladesh and bengal as a whole. it seems likely that a lot of bengalis are indo-europeanized santhals.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at


See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar