Tag: Admixture

The Cape Coloureds are a mix of everything

By Razib Khan | June 16, 2011 12:34 am

A Cape Coloured family

I’ve mentioned the Cape Coloureds of South Africa on this weblog before. Culturally they’re Afrikaans in language and Dutch Reformed in religion (the possibly related Cape Malay group is Muslim, though also Afrikaans speaking traditionally). But racially they’re a very diverse lot. In this way they can be analogized to black Americans, who are about ~75% West African and ~25% Northern European, with the variance in ancestral proportions being such that ~10% are ~50% or more European in ancestry. The Cape Coloureds though are much more complex. Some of their ancestry is almost certainly Bantu African. This element is related to the West African affinities of black Americans. And, they have a Northern European element, which likely came in via the Dutch, German, and Huguenot settlers (mostly males). But the Cape Coloureds also have other contributions to their genetic heritage. Firstly, they have Khoisan ancestry, whether from Bushmen or Khoi. This is well known in their oral memory. The the hinterlands of the Cape of Good Hope are beyond the ecological range of the Bantu agricultural toolkit, so the region was still dominated by the Khoisan when the Europeans arrived. But there are also other suggestions of ancestry from Asia. The existence of the Cape Malays, whose adherence to Islam derives from the Muslims slaves brought by the Dutch, hints at likely relationships to the populations of maritime Southeast Asia. Finally, there are the Indians. This element is not too well recalled in cultural memory. But the Dutch brought many slaves from India as well as Southeast Asia. The Dutch first governor of the Cape Colony had a maternal grandmother who was an Indian slave, by various accounts Goan or Bengali (the town of Stellensbosch is named for him). No doubt it was far more likely that the usual lot of the descendants of Indian slaves during the Dutch era would be to be absorbed into the melange of the Coloured population than assimilated into what later became the Afrikaners.

Why is this aspect of Cape Coloured ancestry forgotten? I think part of the reason is that there is a large South African Indian community present today, but that community post-dates the Dutch period, and arrived with the British. When South Africans think of Indians they think of these people. Interestingly when the new genetic studies confirming Indian ancestry came on the scene I was “corrected” several times by Indians themselves when reporting this part of the Coloured heritage. They were under the impression I must be mistaken, as no one was familiar with the Cape Coloureds having Indian ancestry. Unfortunately pointing to PCA and STRUCTURE plots did not clear up the confusion.

In any case, thanks to the African Ancestry Project I now have three unrelated Coloured samples (I have more, but they are related). Since AAP is Afrocentric I thought it would be appropriate to run the Coloured samples separate first. So that’s what I did.

Read More

Flavors of Afro-Asiatic

By Razib Khan | June 10, 2011 4:48 pm

In the post yesterday I reported what was generally known about the Horn of Africa, that its populations seem to lie between those of Sub-Saharan African and Eurasia genetically. This is totally reasonable as a function of geography, but there are also suggestions that this is not simply a function of isolation by distance (i.e., populations at position 0.5 on the interval 0.0 to 1.0 would presumably exhibit equal affinities in both directions due to gene flow). For example, you observe the almost total lack of “Bantu” genetic influence on the Semitic and Cushitic populations of the Horn of Africa, and the lack of Eurasian influence in groups to the south and west of the Horn except to some extent the Masai.

Tacking horizontally in terms of discipline, over the past few generations there has been a veritable cottage industry making the case for the recent origin of many ethno-linguistic populations through a process of cultural self-creation. Clearly there are many cases of this, some of them studied in depth by anthropologists (e.g., the shift from Dinka to Nuer identity). But there has been an unfortunate tendency to over-generalize in this direction. In some ways this is peculiar insofar as these models presuppose the infinite plasticity of culture without observing the sharp and strong norms which those very same phenomenon can enforce. The genetic isolation of non-Muslims in the Middle East after the rise of Islam seems rather well validated by the evidence from genomics. The norms of both Muslims and non-Muslims strongly biased them toward endogamy, and nature of Islamic hegemony and domination was such that Muslims were the ones who were likely to have cosmopolitan affinities with the “Islamic international.” In contrast, non-Muslim minorities began a long process of involution after the Islamic Arab conquests, only disrupted in the past century by emigration and to a lesser extent emancipation.

So back to the Horn of Africa. The vast majority of the people of the Horn of Africa speak an Afro-Asiatic language. Arabic and Hebrew are the most famous members of this group, but it is a very broad classification, ranging from the dialects of the Berbers in the Maghreb all the way to ancient Akkaddian. There are two large subfamilies of particular note and interest here: Semitic and Cushitic. The map above shows the distribution within the Horn of Africa. One can “quick & dirty” summarize the pattern here by observing that Semitic languages in Ethiopia tend to be concentrated in the north-central Christian highlands, while Cushitic is found everywhere else. Additionally, there is the confluence between religion and ethnicity, as there are Cushitic Muslims (Somalis, Afar, etc.) and Cushitic Christians (many Oromo, etc.). From what I can gather many Cushitic social and political elites have had a tendency toward assimilating into an Amhara Semitic identity (Haile Selassie’s mother was a Muslim Oromo). We could therefore generate a possible model where Semitic langauges arrived late to Ethiopia and spread through elite emulation, so the difference between Semitic and Cushitic peoples should be marginal in the genomic dimension (such as the marginal differences between Hausa and Yoruba in Nigeria). Or, we could posit that the Semitic element is distinctive from a pre-existent Cushitic substratum.

To make a long story short by running more ADMIXTURE with a Horn of Africa centered data set I have discerned that one can actually differentiate Cushitic and Semitic elements in the Horn and tentatively identify them with different ancestral components. First, the technical details….

Read More

CATEGORIZED UNDER: Anthroplogy, Genomics

A genomic sketch of the Horn of Africa

By Razib Khan | June 9, 2011 7:00 pm

Iman, a Somali model

Since I started up the African Ancestry Project one of the primary sources of interest has been from individuals whose family hail for Northeast Africa. More specifically, the Horn of Africa, Ethiopia, Eritrea, and Somalia. The problem seems to be that 23andMe’s “ancestry painting” algorithm uses West African Yoruba as a reference population, and East Africans are often not well modeled as derivative of West Africans. So, for example, the Nubian individual who I’ve analyzed supposedly comes up to be well over 50% “European” in ancestry painting. Then again, I”m 55-60% “European” as well according that method! So we shouldn’t take these judgments to heart too much. Obviously something was off, and thanks to Genome Bloggers like Dienekes Pontikos we know what the problem was: the populations of the Horn of Africa have almost no distinctive “Bantu” element to connect them with West Africans like the Yoruba. Additionally, a closer inspection shows that the “Eurasian” component present in these populations is very specific as well, almost totally derived from Arabian-like sources. When breaking apart the West Eurasian populations it is no surprise that Northern Europeans and Arabians are among the most distant pairs, even excluding recent Sub-Saharan African admixture. The HapMap Utah European American sample and the Nigerian Yoruba are very suboptimal for people with eastern African background. In contrast, African Americans are a mixture of West Africans and Northern Europeans, so the ancestry painting algorithm has nearly perfect reference populations for them. The results for African Americans may not be very detailed and rich, but they’re probably pretty accurate at the level of grain which they’re offering results.

Though I’m happy to give people of Northeast African ancestry more detailed results than 23andMe, one of my motivations for the African Ancestry Project was to obtain a data set which would allow me to explore the genomic variation in the east of Africa myself. This region is a strong candidate for “source” populations for non-Africans within the last 100,000 years, and, it seems to have experienced rapid population turnover within the last 2,000-3,000 years. My data set is not particularly adequate to my ambitions, yet. But I do now have 5 unrelated Somalis. To my knowledge there hasn’t much exploration of Somali genomics using thick-marker SNP chips, so why not? N = 5 is better than N = 0 in these cases of extreme undersampling.

Before I proceed to methods and results, I want to note that I put up most of my files here. It’s a ~25 MB compressed folder with images, spreadhseets, as well as raw output from ADMIXTURE and EIGENSOFT. I hope readers will take this as an invitation to poke around themselves.

Read More

Zombie genome blogging

By Razib Khan | May 30, 2011 12:30 pm

Happy Memorial Day, if you’re American!

Dienekes has some very interesting posts up over at Dodecad, How to create Zombies from ADMIXTURE etc., and More Zombies: Ancestral North Indians and Ancestral South Indians reborn. If you are playing with ADMIXTURE this are going to be very useful in the future.

MORE ABOUT: Admixture, Zombies

ADMIXTURE, African Ancestry Project, and confirmation bias

By Razib Khan | May 2, 2011 1:29 pm

I’ve been running the African Ancestry Project for a while now on the side on Facebook. But it’s getting unwieldy, so I finally set up the website. The main reason I started it up is that there have been complaints for a while now of problems with the 23andMe “ancestry painting” and such for some African groups. For example, a Nubian might be 70% “European.” One might argue that this is due to Arab admixture, but this is clearly not so if you look at the PCA plot. What’s going on? Probably a problem with the reference populations (only Yoruba for Africa), ascertainment bias in the chip (they’re tuned to European variation), and the fact that African genetic variance can cause some issues. I don’t know. But the problem has been persistent, and since most of the other genome blogging projects exclude Africans because they’re so genetically diverse I decided to take it on.

Three groups of people have submitted:

– People of the African Diaspora in the New World

– People from Africa, disproportionately Northeast Africans (Horn of African + Nubia, etc.)

– People of some suspected or known minor component of African ancestry

I’m at ~70 participants now. As one reference population set I’ve been using a subset of Henn et al. as well as some populations from Behar et al. I call this my “thin” set since there are only ~40,000 SNPs. A “thick” set has on the order of 300-400 thousand markers. But fewer populations. I’ve been putting the AAP members through ADMIXTURE in batches of 10, but I also run them all together sometimes for apples-to-apples comparisons. Yesterday I ran AF001 to AF070 from K = 2 to K = 14, unsupervised, with the thin reference. If you want to see all the results, go here. Doing all this myself over and over has given me some intuition as to the pitfalls in this sort of analysis.  Especially in the area of confirmation bias.

Read More

CATEGORIZED UNDER: Genetics, Genomics

The continuing tangling of the human tree

By Razib Khan | April 27, 2011 3:59 pm

ResearchBlogging.orgLast summer I made a thoughtless and silly error in relation to a model of human population history when asked by a reader the question: “which population is most distantly related to Africans?” I contended that all non-African populations are equally distant. This is obviously wrong on the face of it if you look at any genetic distance measures. West Eurasians, even those without recent Sub-Saharan African admixture (e.g., North Europeans) are closer than East Eurasians, who are often closer than Oceanians and Amerindians. One explanation I offered is that these latter groups were subject to greater genetic drift through a series of population bottlenecks. In this framework the number of generations until the last common ancestor with Sub-Saharan Africans for all groups outside of Africa should be about the same, but due to evolutionary factors such as more extreme genetic drift or different selective pressures some non-African groups had diverged more from Africans than others in terms of their genetic state. In other words, the most genetically divergent groups in relation to Africans did not diverge any earlier, but simply diverged more rapidly.

Dienekes Pontikos disagreed with such a simple explanation. He argued that admixture or gene flow between Africans and non-African groups since the last common ancestor could explain the differences. I am now of the opinion that Dienekes may have been right. My own confidence in the “serial bottleneck” hypothesis as the primary explanation for the nature of relationships of the phylogenetic tree of human populations is shaky at best. Why my errors of inference?

There were two major issues at work in my misjudgments of the arc of the past and the topology of the present. In the latter instance I saw plenty of phylogenetic trees which illustrated clearly the variation in genetic distance from Africans for various non-African groups. Why didn’t I internalize those visual representations? It was I think the power of the “Out of Africa” (OoA) with replacement paradigm. Even by the summer of 2010 I had come to reject it in its strong form, due to the evidence of admixture with Neanderthals, and rumors of other events which were born out to be true with the publishing of the Denisovan results. But to a first approximation the clean and simple OoA was still looming so large in my mind that I made the incorrect inference, whereby all non-Africans are viewed simply as a branch of Africans without any particular differentiation in relation to their ancestral population. Secondarily, I also was still impacted by the idea that most of the genetic variation you see in the world around us has its roots tens of thousands of years ago. By this, I mean that the phylogeographic patterns of 25,000 years in the past would map on well to the phylogeographic patterns of the present. This assumption is what drove a lot of phylogeography in the early aughts, because the chain of causation could be reversed, and inferences about the past were made from patterns of the present. My own confidence in this model had already been perturbed when I made my errors, but it still held some sort of sway in my head implicitly I believe. It is one thing to move on from old models explicitly, but another thing to remove the furniture from your cognitive basement and attic.

I have moved further from my preconceptions between then and now. It took a while to sink in, but I’m getting there. A cognitive “paradigm shift” if you will. In particular I am more open to the idea of substantive back migration to Africa, as well as secondary migrations out of Africa. A new paper in Genome Research is out which adds some interesting details to this bigger discussion, and seems to weigh in further against my tentative hypothesis that serial bottlenecks and genetic drift can explain variation in distance to Africans of various non-African groups. Human population dispersal “Out of Africa” estimated from linkage disequilibrium and allele frequencies of SNPs:

Read More

A best case scenario for unsupervised ADMIXTURE?

By Razib Khan | April 7, 2011 2:59 pm

One of the great things about ADMIXTURE is that the population elements shake out of the data through the logic of the program. The worst thing is that it is then left up to you to make sense of the elements. A useful way to use ADMIXTURE and avoid excessive interpretive fogginess is to figure out individual proportions of contribution from X ancestral groups when you have a pretty good idea that an admixture event did occur between very distinct and distantly related population groups. To some extent the whole New World is a good laboratory for this process. Consider, for example, someone from the Dominican Republic or Puerto Rico. There is a good chance that their ancestry will fractionate into three elements:

– An African one

– An Amerindian one

– A European one

These three elements are sampled from very different locations geographically. The ancestral populations have been separated for tens of thousands of years, with little to no gene flow across them. This means that the allele frequencies of the “source” populations should be relatively different (maximizing Fst). A mapping of inferred allele frequencies between abstract ancestral populations generated by ADMIXTURE to concrete allele frequencies of known source populations is rather straightforward.

So here’s an experiment. I have 40 individuals with non-trivial African admixture. Most of them are African Americans, though some are of Latino heritage, and several of Ethiopian or Somali origin. A minority are also people who have a small quantum of African ancestry, but well above the “noise” threshold. Let’s take four populations from the HapMap: Yoruba, Utah whites, Maasai, and Chinese from Beijing. I merged the data (removing problem individuals), and added the aforementioned 40 individuals. I pruned the data set so that no more than 0.5% of a given SNP is missing across the individuals. I was left with ~120,000 markers.

Then I did two runs of ADMIXTURE: supervised and unsupervised. In the supervised run the HapMap populations were “pure,” while in the unsupervised runs the HapMap populations also had their ancestries inferred. Here are the population breakdowns for the HapMap populations in the unsupervised run:

Read More

CATEGORIZED UNDER: Genetics, Genomics

You learn from failure

By Razib Khan | April 2, 2011 1:32 am

In yesterday’s post on African genetics I tried to work with a large set of populations, but narrowed SNPs down to ~40,000. Today I thought I’d go another route, focus on having a thicker market set, but with fewer populations. So I did a bunch of runs with 400,000 SNPs. Here’s K = 8. Please note, I did some “trial” runs and pulled out people with obvious admixture which was recent or an outlier within their population. (e.g., Mozabites with a lot of Sub-Saharan African or San which obviously had European ancestry).

Notice that there are three non-Sub-Saharan modal components. South of the Sahara the European one is absent. But here’s the weird thing. Below are MDS representations of genetic distance between the ancestral groups inferred above:

Read More

CATEGORIZED UNDER: Genetics, Genomics

Another genome blogger….

By Razib Khan | March 26, 2011 7:53 pm

Reader “Diogenes,” with ADMIXTURE in hand, and way more knowledge of archaeology than I can comprehend, now has a blog. Why am I starting a blog…:

I named my blog Artemis since I believe the “Neolithic” which shaped our world for the last 10,000 years is now ending. Demeter’s shackles are broken.

I’m starting my own Project playing with ADMIXTURE and other programs. I’m not a scientist (even though I work in a field related to biology), but I’ll try to substantiate my thoughts whenever possible.

His interest seems to be the Neolithic Revolution/Evolution in Europe.

MORE ABOUT: Admixture, Genomics

Genome bloggers & Indian genomics

By Razib Khan | March 22, 2011 11:32 pm

Dienekes, David, and Zack, have now integrated the insight from Reconstructing Indian History that the programs which infer population structure, such as STRUCTURE, frappe, and ADMIXTURE, can produce ancestral components which are themselves actually stabilized hybrids. In particular the “South Asian” component in many of these analyses may be an ancient admixture between a European-like population, “Ancestral North Indians,” and an indigenous population with marginally greater affinities to East Asians than Europeans, “Ancestral South Indians.”

Here are the posts of interest:

Ancestral North Indian – Ancestral South Indian (ANI/ASI) inferred proportions for South Asian members

Reconstructing the Ancestral North Indian (ANI) genome

Dienekes on ANI/ASI

CATEGORIZED UNDER: Genetics, Genomics

Input determining output in ADMIXTURE

By Razib Khan | March 21, 2011 3:39 pm

One reason I posted about how to run ADMIXTURE was so that the more readers themselves could become familiar with the biases of the program. That way they would get cautious about over-reading one particular set of results (the same goes for using PCA to visualize genetic relationships). Dienekes elaborates in detail on this point, A note of caution on admixture estimates:

Much more can be said on this issue, but let’s summarize a couple of lessons:

– The full extent of an admixture cline can be captured only if unadmixed populations on either side of the cline exist. Use as many populations as possible to capture the full extent of an admixture cline.

– Use of an admixed population in lieu of an unadmixed native one inflates the inferred native component. Use native populations if possible instead of admixed ones .

– Even in the absence of unadmixed native populations, it is sometimes possible to reconstruct the admixture proportions as per Reich et al. (2009).

Capturing the complexities of human prehistory from modern populations is tricky. Nonetheless, with increased coverage of human genetic diversity (there are already ~9k individuals in my database), new analytical techniques, and, hopefully some archaeogenetic calibration genotypic, we are bound to learn much more about the distant human past in the not-so distant future.

9,000 individuals in one random guy’s database. Wow. I know Zack has more than 6,000 now. I’ve got around 4,000 myself without much expenditure of effort.

I think it is important to be very cautious about looking at ADMIXTURE results alone and in isolation. The range of possible models one could generate from a set of ADMIXTURE bar plots is enormous. A synthesis of ethnographic, historical, and paleoanthropological information, is necessary to really squeeze further analytic juice out of these powerful new tools.

CATEGORIZED UNDER: Genetics, Genomics

Eurasia, ADMIXTURE supervised & unsupervised

By Razib Khan | March 16, 2011 12:45 am

After yesterday’s post I thought it might be useful to see how running ADMIXTURE in different modes would impact the outcomes. Probably the major reason I wish more people would use this software is that they’d see that this program is just a program, and stop assuming its outputs to be divine writ. Over the years I’ve noticed a tendency of individuals anchoring to one specific plot in one specific paper as if it supported their argument definitively. Running ADMIXTURE or PCA plots via EIGENSOFT makes you very aware of how useless this sort of stance is.

Today I’ve limited the population set to be “South Asia-centric.” Specifically, there are only a few Middle Eastern, European, and East Asian populations, along with one African population. The goal is to figure out how different South Asian groups relate to these non-South Asian groups. First, I ran ADMIXTURE K = 2 to K = 9. Then, I ran ADMIXTURE in “supervised” mode for K = 9. Basically, I set nine populations as as “pure” references. They were:

– Tamil Dalit
– French Basque
– Lithuanian
– Adygei
– Palestinian
– Buryat (Altaic region)
– Dai (South Asian)
– Papuan
– Luhya (Kenya)

Read More

CATEGORIZED UNDER: Genetics, Genomics, Uncategorized

Analyzing ancestry with ADMIXTURE, step by step

By Razib Khan | March 14, 2011 3:55 pm

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of  genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

Read More

CATEGORIZED UNDER: Genetics, Genomics, Personal Genomics

Tea leaves and population substructure

By Razib Khan | February 22, 2011 1:26 am

Image credit: Wikimol

Over the past few months I’ve been encouraging people to pull down ADMIXTURE, and push the public data sets through it. Additionally, you can also convert your  23andMe raw file into pedigree format pretty easily and integrate it into the public data sets with PLINK. I’ve been following Zack’s Harappa Ancestry Project pretty closely, but I’ve been running the software myself and manipulating its parameters and seeing how things shake out. But the more and more I do it, the more I wonder if it isn’t like regression analysis, a technique which is just waiting to be leveraged by human biases. I began thinking of this more deeply after a conversation with a computational biologist who outlined the structural problems with how ad hoc the utilization of statistics is in the life sciences.

These sorts of qualms are probably why I’m posting my results more on Facebook and passing them around friends, rather than putting them out there in the public domain. It isn’t that I think the results are going to be abused. I just don’t know what they mean a lot of the time. Or, perhaps more honestly I am suspicious of my own propensity to see what I suspect. A case of my priors strongly shaping the inferences which I might generate.

So I decided to do an experiment. Below are 8 runs, displayed as bar plots. Each thin sliver represents an individual. The colors again represent putative ancestral populations of which the modern populations are combinations, generated by the parameter K (so K = 2 means two ancestral populations, each corresponding to a different color). There are two data sets which I analyzed, group A and group B. I’ve also noted the K’s for each plot. But aside from that, I’ll leave you ignorant what these populations are or how many there are. Jot down some ideas as to what you can see. How many populations? How do they relate to each other? Can you perceive any real information in the higher K’s? I’ll put the “answers” below the fold. There’s no point in me saying what I think, I already know which populations these are, so I’m tainted.


Read More

CATEGORIZED UNDER: Genetics, Genomics

Eurasia + Mozabites + Papuans

By Razib Khan | February 16, 2011 1:20 pm

I’m in a hurry right now, and won’t be posting much this week. But, I thought I’d dump some of the ADMIXTURE runs I have. This is one with 80,000 markers, and Eurasian populations, Papuans and Mozabites. I removed the New World and Africa to constrain the variance space. This time I’ve labelled the ancestral components, but do not take them totally literally. I think in the future I might just remove the Kalash to see what happens. This is K = 7. Not too busy, but I think enough K’s to separate out the various West Eurasian groups. Additionally I’ve put the genetic distances, Fst, below, and, visualized them on an MDS. Nothing too surprising.

Northeast Asian South Asian European West Asian Kalash Southeast Asian Papuan
Northeast Asian 0 0.1 0.13 0.142 0.137 0.052 0.225
South Asian 0.1 0 0.058 0.07 0.066 0.097 0.201
European 0.13 0.058 0 0.056 0.075 0.132 0.236
West Asian 0.142 0.07 0.056 0 0.098 0.139 0.238
Kalash 0.137 0.066 0.075 0.098 0 0.136 0.243
Southeast Asian 0.052 0.097 0.132 0.139 0.136 0 0.214
Papuan 0.225 0.201 0.236 0.238 0.243 0.214 0

Read More

CATEGORIZED UNDER: Genetics, Genomics
MORE ABOUT: Admixture, Genomics

D.I.Y. population structure inference, part 1 of many

By Razib Khan | February 13, 2011 3:37 pm

If you’ve been reading this weblog for a while you’ve seen many images like the one above. It comes from the 2008 paper Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. The data set is from the Human Genome Diversity Project. It consists 52 groups from around the world, curated for representativeness, but also ethnic distinctiveness. They utilized the FRAPPE program, which like STRUCTURE and ADMIXTURE estimates the ancestry of individuals (and in the aggregate populations) from a a combination of components, the number of which you specify with the parameter K. In other words, this is model based. It works out really well when you have an intuition of the model you’re looking for. Imagine African Americans, who you can presume are a two-way admixture between two distinct ancestral populations. It works less well in other cases. For example, South Asians are modeled by 23andMe as a two-way admixture between Europeans and East Asians. Why this occurs is totally comprehensible; they have three (Chinese + Japanese = one) reference populations which are very different from South Asians. So the computer, being dumb but fast, simply slaps together the best inference possible from the weird constraints placed upon it. Garbage in, garbage out.

But along with PCA these sorts of algorithms which allow one to visualize variance across hundreds of thousands of markers across hundreds of individuals are very useful (though perhaps there are mentats amongst us who have no need of such techniques). You just have to use them with caution. Information may be free, but it can be misinterpreted!

Over the years of my blogging people have regularly asked questions of the form “are East Africans more closely related to West or South Africans?” They are easily answered, I would just look in the literature. But, it did take time, and I’d have to pick the right figure, look for Fst, and so forth. But that is changing.

The Nature piece “The rise of the genome bloggers” covered the change. Since last fall BGA and Dodecad have been dumping lot bar plots and PCAs on the web. Instead of looking for a paper, I have now begun to use those sites as my resource of first choice (since they’re well indexed by Google). Now with HAP you have another source of information. It’s gotten to the point that technically capable commenters are now submitting their own results!

We’ve come a long way. Academics are not miserly with information, and some of my best friends have been the gatekeepers of the data and results. But now you can find the data on the web easily. You can reprocess the data by yourself. And, you can do the analysis yourself.

I’ve been sitting back for a while, letting Dienekes, Zack, etc., do their thing. There are so many technically fluent people out there, I’ve enjoyed just consuming the raw information yield. But that ends today. Over the past week I’ve been slapping together some R functions to make it easier for me to generate bar plots at various K’s, as well as PCA’s. My goal is this: a reader asks a question, and I quickly constrain my data set appropriately and do the analyses, take the screenshots, and upload them to the servers here, and point them to the images in the comments. The main constrain should be the computational resources (ADMIXTURE can take hours). Yes, that’s where we’re at.

Every now and then I’m going to put up a post of ADMIXTURE bar plots or MDS/PCA’s. Part of the reason is that it will be useful for my later reference. Second, I think the slide show display view is probably pretty useful to get a gestalt sense of what’s going on. That’s what we’re going for: human comprehension. Below is my first slide show, from K = 2 to K = 16. That is, the models assumed two to sixteen ancestral populations. I also excluded Sub-Saharan Africans from the data set since they’re so varied. Here are the details:

Read More

CATEGORIZED UNDER: Data Analysis, Genetics, Genomics

Counting beans the proper way

By Razib Khan | February 10, 2011 9:46 am

Apropos of several of my recent posts, The New York Times has an interesting article up, Counting by Race Can Throw Off Some Numbers. Basically it outlines the difficulty of enumerating different racial and ethnic groups for different purposes in a more diverse and racially mixed USA. Numbers matter when it comes to apportioning resources, and the current methods are often quite coarse (though some interest groups prefer it that way, because it bolsters their numbers). Let’s focus on the point germane to the focus of this weblog:

The National Center for Health Statistics collects vital statistics from the states to document the health of the population. When it comes to collecting birth certificate information, though, the center encounters a problem: 38 states and the District of Columbia report race data in the new and more expansive manner that allows for the recording of more than one race. But a dozen states do not, because they still use old data systems and outdated forms. As a result, the center cannot produce consistent national data for what it calls “medical and health purposes only.”

To get around that problem, the center reclassifies mixed-race births using a complex algorithm. For example, a birth to a parent who marked white, Asian and Native American would be declared just one of those races, depending on a number of variables in a probability model, like sex, age of the mother and place of birth. (Birth data is reported, in most cases, by the race of the mother.)

The medical part is disturbing to me, because I just realized I’ve been part of the problem. You see, the article doesn’t acknowledge that the category “Asian” is genetically incoherent! A friend stated what I was thinking as a good solution: everyone gets a genomic admixture analysis done, and that’s what gets entered into the medical databases. So a white Hispanic with “pure Spanish ancestry” will be counted as white for medical purposes, but counted as Hispanic for the purposes of identity politics. And black Americans who are more than 50% European in ancestry, such as Henry Louis Gates Jr., will be appropriated “weighted” when it comes to medical genetics focusing on African Americans.

CATEGORIZED UNDER: Genetics, Genomics, Health
MORE ABOUT: Admixture, Analysis, Health

Neandertal admixture, revisiting results after shaken priors

By Razib Khan | January 26, 2011 9:14 am

After 2010’s world-shaking revolutions in our understanding of modern human origins, the admixture of Eurasian hominins with neo-Africans, I assumed there was going to be a revisionist look at results which seemed to point to mixing between different human lineages over the past decade. Dienekes links to a case in point, a new paper in Molecular Biology and Evolution,  An X-linked haplotype of Neandertal origin is present among all non-African populations. The authors revisit a genetic locus where there have been earlier suggestions of hominin admixture dating back 15 years. In particular, they focus on an intronic segment spanning exon 44 of the dystrophin gene, termed dys44. Of the haplotypes in this they suggested one, B006, introgressed from a different genetic background than that of neo-Africans. The map of B006 shows the distribution of the putative “archaic” haplotype from a previous paper cited in the current one from 2003. As you can see there’s a pattern of non-African preponderance of this haplotype. So what’s dystrophin‘s deal? From Wikipedia:

Dystrophin is a rod-shaped cytoplasmic protein, and a vital part of a protein complex that connects the cytoskeleton of a muscle fiber to the surrounding extracellular matrix through the cell membrane. This complex is variously known as the costamere or the dystrophin-associated protein complex. Many muscle proteins, such as α-dystrobrevin, syncoilin, synemin, sarcoglycan, dystroglycan, and sarcospan, colocalize with dystrophin at the costamere.

Dystrophin is the longest gene known on DNA level, covering 2.4 megabases (0.08% of the human genome) at locus Xp21. However, it does not encode the longest protein known in humans. The primary transcript measures about 2,400 kilobases and takes 16 hours to transcribe; the mature mRNA measures 14.0 kilobases….

Dystrophin deficiency has been definitively established as one of the root causes of the general class of myopathies collectively referred to as muscular dystrophy. The large cytosolic protein was first identified in 1987 by Louis M. Kunkel…after the 1986 discovery of the mutated gene that causes Duchenne muscular dystrophy (DMD) ….

OK, so we’ve established that this is not an obscure gene. Here’s the abstract of the new paper:

Read More

ADMIXTURE vs. MDS, visualization is just visualization

By Razib Khan | January 18, 2011 1:06 pm

Dienekes did another run of his data with K = 64. He posted a huge plot with the two largest dimensions of variation. He also posted an accompanying spreadsheet with the coordinates of where the Dodecad samples were. So I found my own position pretty quickly. Before going to that, I thought I’d repost a comparison between myself, the HapMap Gujaratis, the North Kannadi sample, and the HGDP Uygurs. This is at K = 10 in ADMIXTURE from Dodecad.

OK, with that in mind, here’s the full MDS with the two largest components of genetic variation. I’ve added large labels. Also, click the image for a larger file so you can read the small labels.

Read More

CATEGORIZED UNDER: Genetics, Genomics

The genetic affinities of Ethiopians

By Razib Khan | January 10, 2011 2:04 pm

In the open thread someone asked: “Any recent stuff on the genetics of Ethiopians.” That prompted me to look around, because I’m curious too. Poking around Wikipedia I couldn’t find anything recent. A lot of the studies are older uniparental lineage based works (NRY and mtDNA). Ethiopia is interesting because unlike almost all other Sub-Saharan African nations it has a long written history. Culturally and linguistically it has both Sub-Saharan African, and non-Sub-Saharan African, affinities. The languages of highland Ethiopia are clearly Semitic. Those of lowland Ethiopia are Cushitic, a branch of the broader Afro-Asiatic language family concentrated around the Horn of Africa (Somali is a Cushitic language, though most Ethiopian nationals who speak a Cushitic dialect are of the Oromo group).

From a human evolutionary genetic perspective, Ethiopia also has specific interest. It is likely that the main recent pulse of humans Out of Africa traversed this region. Additionally, there is some evidence of deep time connections between the groups ancestral to Ethiopians and the Khoisan of southern Africa. It may be that Ethiopians and Khoisan are reservoirs of ancient genetic variation in Sub-Saharan Africa which as been overlain by Bantu in most other regions outside of West Africa. Finally, Ethiopians are known to have high altitude adaptations. This could be due to long term residence in the region, or, assimilation of favorable alleles from the long term residents by later populations.

Fortunately we can get a sense of the genetic affinities of Ethiopians thanks to a paper published last spring, The genome-wide structure of the Jewish people. The focus was clearly on Jews, but they surveyed Amhara & Tigray (Semitic speaking highlanders), Ethiopian Jews (similar ethnically to the Amhara & Tigray, but religiously non-Christian), and Oromo. In the PCA the Oromo and Semitic speaking populations are pretty obviously distinct clusters.

Read More

CATEGORIZED UNDER: Genetics, Genomics

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

See More


RSS Razib’s Pinboard

Edifying books

Collapse bottom bar