PCA, Razib around the world (a little)

By Razib Khan | August 10, 2010 2:32 pm

price_fig1I have put up a few posts warning readers to be careful of confusing PCA plots with real genetic variation. PCA plots are just ways to capture variation in large data sets and extract out the independent dimensions. Its great at detecting population substructure because the largest components of variation often track between population differences, which consist of sets of correlated allele frequencies. Remeber that PCA plots usually are constructed from the two largest dimensions of variation, so they will be drawn from just these correlated allele frequency differences between populations which emerge from historical separation and evolutionary events. Observe that African Americans are distributed along an axis between Europeans and West Africans. Since we know that these are the two parental populations this makes total sense; the between population differences (e.g., SLC24A5 and Duffy) are the raw material from which independent dimensions can pop out. But on a finer scale one has to be cautious because the distribution of elements on the plot as a function of principal components is sensitive to the variation you input to generate the dimensions in the first place.

I can give you a concrete example: me. I showed you my 23andMe ancestry painting yesterday. I didn’t show you my position on the HGDP data set because I’ve shared genes with others and I don’t want to take the step of displaying other peoples’ genetic data, even if at a remove. But, I have reedited some “demo” screenshots and placed where I am on the plot to illustrate what I’m talking about above. The first shot is my position on the two-dimensional plot of first and second principal components of genetic variation from the HGDP data set.


gsa-lillymendel-worldNo surprise that I’m in the Central/South Asian cluster. But what may surprise you is that I’m not in the South Asian cluster, I’m in the Central Asian cluster. In the Central Asian cluster are Uyghurs and Hazaras. These are two hybrid populations, a mixture of West and East Eurasian elements. The Uyghurs are likely the outcome of a process of admixture between the Iranian and Tocharian Indo-European populations of the cities of the Tarim basin, and later Turkic speaking settlers who arrived in the wake of the expansion and later collapse of the first Uyghur Empire (the historical connection between the current Uyghurs and ancient Uyghurs is tenuous at best, and complicated). The Hazaras are a more recent population, likely emerging as the product of intermarriages between Mongol soldiers who arrived in the 13th century, and indigenous women, Persians, Turks, and assorted Indo-Iranian groups between the Zagros and Khyber Pass. It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

23andmepcazoomZoom! Now we’ve constrained the input data set to the Central/South Asian groups. First, look at the Kalash. They’re strange, which is no surprise, they’re an inbred mountain group in Pakistan who have not adopted Islam. The Pakistani Taliban looks to be ending them as we speak. I really would prefer that they were just thrown out of the data set for this zoom view, because on this fine grained scale I don’t think they add much at all. They’re just an example of what long term endogamy can do to your allele frequencies. The bigger picture is the axis between the populations of Pakistan, and those of Central Asia. Observe that I’ve changed position. Whereas when taking world wide genetic variation into account I clustered with Central Asians, now I’m 2/3 of the way to the South Asian cluster. I will tell you that I’ve shared “genes” with around 50 South Asians now, from various parts of the subcontinent, and in the 23andMe plot they overlay the South Asians nearly perfectly. I’ve put labels at the approximate ethno-linguistic position. I’m an outlier. 23andMe tells me that I’m 43% “East Asian.” The typical South Asian is in the 10-30% range. My first assumption was that I have a lot of ancient South Indian, which just shows up as East Asian in their algorithm. With this in mind I tried sharing with a lot of South and East Indians, and found out two interesting points. First, South Indians seem no higher than 30-35% East Asian. Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis. My sample size for Bengalis is small, so take that with caution. Second, the PCA plots put the South Indians firmly in the South Asian cluster, but the Bengalis trail out toward my own position. This indicates again that different methods are telling you slightly different things. The PCA is only a thin slice of variation, but it’s highly informative of between population differences. A Bengali and a South Indian with the same “East Asian” fraction in the ancestry painting nevertheless have consistently different positions on the PCA, with Bengalis closer to the East Asians. Additionally, there’s an ethnic Persian in this zoom plot that I’m describing, and they are positioned near the Balochi. But on the world wide plot they’re on the margins of the European cluster. Another illustration that position of an element is sensitive to the input data because of how the dimensions are generated.

Blaine Bettinger, who inspired me to post this, told a story with his ancestry painting which was plausible. What can I say? First, I have less than 1% African ancestry. This could be noise. But, I do observe that the South Asians with Muslim names are enriched in the set of those who I’ve shared genes with and who have less than 1%, but not 0%, African ancestry. Just as Muslim South Asians have non-trivial West Asian ancestry, I suspect that many of us have Sub-Saharan African ancestry through the same dynamic. Sub-Saharan African soldiers were prominent across South Asia with the arrival of Muslims. Bengal even has a period of rule by Abyssinian rulers. But the bigger issue for me is the East Asian component. Here is a figure from a paper published 4 years ago:

journal.pgen.0020215.g005

The figure is showing Fst value comparing Indian Americans with Europeans and East Asians. Fst measures between population differences in allele frequency, in this case the alleles being 207 indels. Take a look at the Bengalis. These are West Bengalis, who I believe have a lesser East Asian component, but even there the allele frequency difference to East Asians is near that of Europeans. The Assamese, who speak a language very close to Bengali, are similar. Assam was ruled by a Tibeto-Burman people for nearly 600 years. The Oriya speakers, from the southwest of Bengal, are more distant from East Asians. As one goes south and east, and west and north, the distance from East Asians increases. This shouldn’t be that surprising, but nice to confirm. The fact that the genetic distance increases as one goes south means that for northeast South Asia you need to complexify the model from a two-way admixture with “ancient North Indians” and “ancient South Indians.” Set next to these two is an East Asian element, which is also clear in the Indo-Aryan peoples of Nepal.

Sheikh Hasina, Khaleda ZiaOf course anyone who knows Bengalis won’t be totally surprised by an East Asian component to their ancestry. To the left are head shots of the two women who have dominated Bangladeshi politics for the past two decades, Khaleda Zia and Sheik Hasina. They’re both Bengalis, but they do look different, and I know many people who look like one or the other (or a combination). My family is from one of most easternmost districts of Bengali, next to Tripura. In fact my late maternal grandmother lived in Tripura for some of her childhood (she was almost trampled to death by the Maharani of Tripura’s insane elephant as a young girl!). When I was a young child I once saw a black and white photo from my father’s college days, and I was curious who the Asiatic looking young man in the middle of the photograph was. Turns out it was my father! Sometimes our expectations affect how we perceive people. I have never perceived my father to have an Asian cast to his features as a more mature man, but others have told me that he does still exhibit them.

There is still the question of how Bengalis came to have this particular admixture. I think the most plausible scenario probably synthesizes conventional village-to-village intermarriage and isolation-by-distance, along with some component of migrationism. Tribes such as the Chakma have left Burma in historical time. The Chakma of Bangladesh now speak a dialect of Bengali, not their ancestral Sino-Tibetan tongue. I believe that a non-trivial portion of Bengalis have ancestors who were tribal people who shifted their religious identity to that of Hinduism or Islam (from Theravada Buddhism in the case of the Chakma, or animism in the case of the Garos before their Christianization). But eastern South Asia is adjacent to mainland Southeast Asia, and it stands to reason that continuous gene flow would over time would also have introduced East Asian alleles into the Bengali gene pool.

Image Credit: TopNews.in

CATEGORIZED UNDER: Genetics, Genomics
  • Pingback: Tweets that mention PCA, Razib around the world (a little) | Gene Expression | Discover Magazine -- Topsy.com()

  • manju

    It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

    By Genghis Khan’s time, R1a1 was quintessentially Mongolian too. Wasn’t it?

    May I know your mtDNA haplogroup?

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    By Genghis Khan’s time, R1a1 was quintessentially Mongolian too. Wasn’t it?

    no, only a small minority. most mongolians are:
    http://en.wikipedia.org/wiki/Haplogroup_C_(Y-DNA)

    mtDNA is U2b. my uniparental lineages are totally conventional.

  • manju

    Thanks. Any idea about Bangladeshi mtDNA breakdown (between South Asia specific and East Asia specific)?

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    bangladesh has the highest frequency of M in the world
    http://en.wikipedia.org/wiki/Haplogroup_M_(mtDNA)#Asian_origin_Theory

    all the other bengalis i’m sharing genes with are of M.

  • bat

    Thank you, Mr Khan! Your posts and discussions are very interesting for us – Mongolians. We-Mongolians somehow do not pay much attention to these things. I do not know why. Perhaps, we are so preoccupied with our lives, or we take things guaranted that we think “well, it makes a sense because we-Mongols ruled the half of the known world”, or something.

    Keep writing, Mr Khan!

  • SV

    If I am interpreting that Fst chart correctly, the Tamils have a smaller allele frequency difference with West Eurasians than Gujaratis do with West Eurasians. I would not have expected that, geographically.

  • onur

    I’ve begun to wonder whether Hazaras and Uyghurs are the same people. I am serious! Because in every genetic study involving them, they appear genetically virtually indistinguishable from each other, and not just in PCA plots, but also in more serious and detailed genetic analyses like STRUCTURE, ADMIXTURE and frappe. Also physically they look very similar to each other.

    We know that Turkic peoples and Mongolians had connections historically in various times and places (the nature of their linguistic, cultural and genetic relationship isn’t clear though). Also we know that armies of Chinggis and his descendants included various Turkic peoples in addition to Mongolians and there were all sorts of civilians accompanying them from various ethnicities. Uyghurs (a Turkic people) had a special place in Turkic and Mongolian realms as they were a largely settled and civilized people and carriers of a high civilization to the otherwise nomadic and uncivilized Turkic and Mongolian societies of Central Asia (including Mongolia) and we know that Uyghurs had a very significant cultural influence on Mongolians beginning from pre-Chinggisid times. So Hazaras’ Mongolian connection may have something to do with Uyghurs as well (directly or indirectly). But there are so many mysteries in relevant histories that there is still much ambiguity.

    A similar genetic similarity can be seen between Tajiks and Uzbeks, but they are geographically very close and we know that they are historically and culturally very related, so that isn’t surprising. But the genetic similarity between Uyghurs and Hazaras is more of a surprise because of the geographical distance and the ambiguity of the historical connection (at least compared to that between Uzbeks and Tajiks).

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    bat, please don’t insult other commenters.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    onur, can you point to the structure plots you’re thinking about? also, do you know if the uyghurs carry the khan Y chromosomal lineage as the hazaras do?

  • onur

    Bat, I used the word “civilization” in its narrow meaning, which doesn’t include nomadic cultures, so I didn’t use the word “uncivilized” as an insultation. BTW, I am Mr (male).

  • onur

    onur, can you point to the structure plots you’re thinking about?

    They are so many (in fact, all STRUCTURE and similar analyses I have seen involving the two populations) that I cannot list them all. But the last one that you assessed on your blog is particularly a clear example:

    http://download.cell.com/AJHG/mmcs/journals/0002-9297/PIIS0002929710003642.mmc1.pdf

    also, do you know if the uyghurs carry the khan Y chromosomal lineage as the hazaras do?

    Good question. Your thread about Chinggis Khan has a map showing the proportions of the star-cluster chromosomes that are thought to be related to the Chinggisid family in various Eurasian populations, and luckily, it includes both Hazaras and Uyghurs:

    http://blogs.discovermagazine.com/gnxp/2010/08/1-in-200-men-direct-descendants-of-genghis-khan/

    As you see, Hazaras have more of it than Uyghurs, but that may be related to genetic drifts due to Hazaras being probably a very small population for most of their history. Mongolians proper have almost the same proportion of the “Chinggisid” star-cluster chromosomes with Kazakistani Uyghurs and Chinese Kazaks, so genetic drift is probably the best explanation for its particular abundance in Hazaras. Strangely, on the same map Chinese Uyghurs show no “Chinggisid” star-cluster, while Kazakistani Uyghurs have it in abundance; that is probably an error as even some Han groups have some “Chinggisid” star-cluster.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    Strangely, on the same map Chinese Uyghurs show no “Chinggisid” star-cluster, while Kazakistani Uyghurs have it in abundance; that is probably an error as even some Han groups have some “Chinggisid” star-cluster.

    i don’t want to open up this can of worms in detail, because i weary of deleting/banning commenters with strong nationalistic inclinations. but a lot of the turkic national groups of the former soviet union and east turkestan of the 19th and 20th century are relatively recent constructions from what i have read. e.g., “uzbek” was a specific term for the ruling dynasty/caste, while the commoner turks were termed “sart.”

    http://en.wikipedia.org/wiki/Sart

    i can accept that the hazaras have many uyghur ancestors among their male line. though i think it is clear that their maternal lineage would be slightly different. additionally, the balance between “eastern” and “western” differs somewhat. i think they’re parent populations are genetically analogous to each other so it might not work, but finding PC3 and PC4 might resolve this.

  • onur

    i can accept that the hazaras have many uyghur ancestors among their male line. though i think it is clear that their maternal lineage would be slightly different. additionally, the balance between “eastern” and “western” differs somewhat. i think their parent populations are genetically analogous to each other so it might not work, but finding PC3 and PC4 might resolve this.

    I am not a person receptive of “anomalies”, so my thoughts are similar to yours on this issue.

  • onur

    i don’t want to open up this can of worms in detail, because i weary of deleting/banning commenters with strong nationalistic inclinations. but a lot of the turkic national groups of the former soviet union and east turkestan of the 19th and 20th century are relatively recent constructions from what i have read. e.g., “uzbek” was a specific term for the ruling dynasty/caste, while the commoner turks were termed “sart.”

    A lot of modern identities are recently (19th to 20th centuries; for some Western countries, late 18th century) constructed (as a result of the “nationalism” disease and changing borders). The situation of the “Turkish” identity of modern Turks isn’t so different from the “Uzbek” identity of modern Uzbeks.

  • toto

    Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis.

    *looks at map*

    Well, duh! :)

    More seriously, it’s not quite as obvious as it seems – apparently Kashmiri Pandits don’t seem to share much with their “yellow” neighbours.

  • djw

    It looks like the Kalash “use up” one of the two dimensions available for the presentation of the data in that PCA plot with their internal variation. If they were removed (or the PCA axis rotated so that Kalash internal variation is not shown) then I bet you could see some more interesting structure in the rest of the populations.

    I noticed the same problem with Orcadians in the 23andme northern european PCA plot. The Orcadians form a long thin line and the rest of the northern european population forms a somewhat thicker line perpindicular to the Orcadians.

  • djw

    I am wondering why the Kalash have such a large impact on the PCA plot. My naive expectation is that a small isolated population would have less genetic diversity than the larger populations in the plot and would appear as a small cluster a considerable distance from the rest.

    The best explanation that I have been able to come up with is that recent immigration into the Kalash that adds a more typical south asian component to their population and the Kalash now cover the continuum from unadmixted to highly admixed. If we fast forward 100 years and take a pca snapshot every ten years would we see the line of Kalash grow towards the rest of the south asian populations and then get absorbed into it? Or is my does my model miss something important?

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    djw,

    interesting points.

    1) from what i know the kalash leave, but no one joins them. they’ve been isolated since partisans of the religion of peace forcibly converted their cultural kin across the border in afghanistan circa 1900. but before 1900 there may have been more fluidity

    2) it could be that there’s clan substructure in the kalash, that they started out more like their neighbors (i think this is true) and some inbred clans have shifted more than others from initial allele frequency

  • onur

    I wonder how the same samples would show up in a STRUCTURE-like analysis. For instance, the Kalash may show some degree of heterogeneity somewhat similar to that among African Americans looking at the above PCA plot involving the Kalash.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    onur, it’s in the supplements to the aboriginal paper
    http://download.cell.com/AJHG/mmcs/journals/0002-9297/PIIS0002929710003642.mmc1.pdf

    compare structure vs. frappe. i believe structure is more distorted by linkage disequilibrium, so that may explain the difference.

  • onur

    Razib, I had aready looked at those analyses, what I really wonder is how the Kalash would show up in a STRUCTURE-like plot involving only the exact same populations seen on the above PCA plot that involves the Kalash.

    Anyway, on the STRUCTURE-like analyses and PCA plots involving the Kalash I’ve seen so far, the Kalash don’t show up more heterogeneous than the other Central and South Asian populations, but the reverse may be true. So the above PCA plot seems somewhat anomalous, but I think it has other explanations.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    here’s a thought. the x-y on that zoom plot is mostly constructed by the largest components of variation within the non-kalash groups. they’re numerically preponderant, and if the kalash are genetic isolates, which we can adduce from other sources, then the x-y axes may not capture much of their genetic variation. rather, what it’s capturing is the component of the kalash variance which is shared with the other south asian groups, which varies from kalash to kalash.

  • onur

    Yes, I was thinking more or less along the same lines. If PC3 and beyond were examined instead of PC1 and PC2, the Kalash genetic variation could have been captured to a high degree and they would probably show up genetically pretty homogeneous then.

  • djw

    Decodeme has a 3d PCA plot that allows you to select 3 out of the top six axis of variation to look at. You can drag the axis around and look at them from any angle you please, but I find that makes it more confusing.

    In any case, when I look at pca1 vs pca2 the decodme plot looks pretty much like the one that you posted. pca1 vs pca3 has the Kalash cluster with the other pakistan groups in a continuum along pca3 from kalash to burusho. The Uygur and Hazara are outliers separated from the rest along pca1. They still overlay each other.

    I think this confirms your guess in post 23, but I don’t have enough real experience to say more. You might be able to create a free account there and look at this. Alternatively, I can try to figure out how to take a screen shot.

    By the way, do you know of a paper I can look at to see how these pca plots are defined? My background is in physics so I am comfortable with the idea of eigenvectors, but I have not quite been able to deduce how to build the vectors out of snp raw data.

  • http://blogs.discovermagazine.com/gnxp Razib Khan

    the pca is always done in matlab. you can get a tarball of the hgdp data set of snps from standford’s website i think. i’m going to look into the details myself in the near future, so perhaps i’ll post on it.

  • p-ter
  • djw

    Thanks for the link p-ter, that paper really helped.

    I wonder if the “stringy” look of the Kalash in this plot is actually an artifact of the normalization. Playing around with the decode me kinship map most populations appear “stringy” when I plot a high variation axis against a lower variation axis. I suspect that they normalize the low variation axis to the high variation axis with the result that the populations seemed stretched along the low variation axis.

    When I plot pca2 vs pca3 the variation of the kalash population along pca2 looks much more compact.

  • Antonio

    I got late to this post, but I have a quick question though. How reliable are the information provided by companies like 23andme ? I know that the results can change from one company to another, based on the used methodology and data sets. But by how much? I have reading about these test for a while and I am myself considering to take one of these. Yet, I’m totally unsure about which credible intervals should I assign to the results. Thank you,

NEW ON DISCOVER
OPEN
CITIZEN SCIENCE
ADVERTISEMENT

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com

ADVERTISEMENT

See More

ADVERTISEMENT

RSS Razib’s Pinboard

Edifying books

Collapse bottom bar
+

Login to your Account

X
E-mail address:
Password:
Remember me
Forgot your password?
No problem. Click here to have it e-mailed to you.

Not Registered Yet?

Register now for FREE. Registration only takes a few minutes to complete. Register now »