HapMap Project logo
International HapMap Project

Home | About the Project | Data | Publications | Tutorial

中文 | English | Français | 日本 | Yoruba

About the International HapMap Project

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation. The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. The information produced by the Project will be made freely available.

The Project is a collaboration among scientists in Japan, the U.K., Canada, China, Nigeria, and the U.S. [See Participating Groups and Initial Planning Groups.] The Project officially started with a meeting on October 27-29, 2002 (http://genome.gov/10005336), and is expected to take about three years.

Genetic variation and use of the HapMap

Most common diseases, such as diabetes, cancer, stroke, heart disease, depression, and asthma, are affected by many genes and environmental factors. Although any two unrelated people are the same at about 99.9% of their DNA sequences, the remaining 0.1% is important because it contains the genetic variants that influence how people differ in their risk of disease or their response to drugs. Discovering the DNA sequence variants that contribute to common disease risk offers one of the best opportunities for understanding the complex causes of disease in humans.

Sites in the genome where the DNA sequences of many individuals differ by a single base are called single nucleotide polymorphisms (SNPs). For example, some people may have a chromosome with an A at a particular site where others have a chromosome with a G. Each form is called an allele.

describes an allele
A part of two chromosomes showing a SNP. Both the A and G alleles are shown.

Each person has two copies of all chromosomes except the sex chromosomes. The set of alleles that a person has is called a genotype. For this SNP a person could have the genotype AA, AG, or GG. (See http://www.dnaftb.org/dnaftb/ for basic genetics information.) The term genotype can refer to the SNP alleles that a person has at a particular SNP, or for many SNPs across the genome. A method that discovers what genotype a person has is called genotyping.

About 10 million SNPs exist in human populations, where the rarer SNP allele has a frequency of at least 1%. Alleles of SNPs that are close together tend to be inherited together. A set of associated SNP alleles in a region of a chromosome is called a "haplotype". Most chromosome regions have only a few common haplotypes (each with a frequency of at least 5%), which account for most of the variation from person to person in a population. A chromosome region may contain many SNPs, but only a few "tag" SNPs can provide most of the information on the pattern of genetic variation in the region.

A chromosome region with only the SNPs shown. Three haplotypes are shown. The two SNPs in color are sufficient to identify (tag) each of the three haplotyes. For example, if a chromosome has alleles A and T at these two tag SNPs, then it has the first haplotype.

The HapMap will describe the common patterns of genetic variation in humans. It will include the chromosome regions with sets of strongly associated SNPs, the haplotypes in those regions, and the SNPs that tag them. It will also note the chromosome regions where associations among SNPs are weak.

Researchers trying to discover the genes that affect a disease, such as diabetes, will compare a group of people with the disease to a group of people without the disease. Chromosome regions where the two groups differ in their haplotype frequencies might contain genes affecting the disease. Theoretically, researchers could look for these regions by genotyping 10 million SNPs. However, the methods to do this are currently too expensive. The HapMap will identify which 200,000 to 1 million tag SNPs provide almost as much mapping information as the 10 million SNPs. This substantial cost reduction will make such studies feasible to do.

Populations and samples

Most of the common haplotypes occur in all human populations; however, their frequencies differ among populations. Therefore, data from several populations are needed to choose tag SNPs. Pilot studies have found sufficient differences in haplotype frequencies among population samples from Nigeria (Yoruba), Japan, China and the U.S. (residents with ancestry from Northern and Western Europe, collected in 1980 by the Centre d'Etude du Polymorphisme Humain (CEPH) and used for other human genetic maps) to warrant developing the HapMap with large-scale analysis of haplotypes in these populations. The HapMap developed with information from these populations should be useful for all populations in the world. However, to assess how much more information would be gained by including other populations, a parallel study will examine haplotypes in a set of chromosome regions in samples from several additional populations.

The DNA samples for the HapMap will come from a total of 270 people: from the Yoruba people in Ibadan, Nigeria (30 both-parent-and-adult-child trios), Japanese in Tokyo (45 unrelated individuals), Han Chinese in Beijing (45 unrelated individuals), and the CEPH (30 trios). These numbers of samples will allow the Project to find almost all haplotypes with frequencies of 5% or higher. All of the new samples collected for the Project are being obtained with protocols approved by the appropriate ethics committees, after culturally appropriate processes of community engagement or public consultation and individual informed consent. The community engagement process is designed to identify and attempt to respond to culturally specific concerns and give the participating communities direct into the informed consent and sample collection processes.

The CEPH samples are available from the non-profit Coriell Institute for Medical Research (http://locus.umdnj.edu/nigms/). DNA and cell lines from the other blood samples will be available from Coriell in 2004, for future studies approved by the appropriate ethics committees. No medical or individually identifying information will be linked to the samples; they will be labeled only by sex and population. An advisory group will be set up for each community where new samples are being collected, to serve as a liaison with Coriell and make sure that future uses of the samples are consistent with the terms of the consent form.

Ethical issues

The Project raises several ethical issues. Since the samples include no personal identifiers, the privacy risks to individual donors are minimal. However, each sample will be labeled by population, to allow researchers to choose tag SNPs that are most useful for each future study population. The tag SNPs will be chosen based on the haplotype frequencies. The tag SNPs for some regions might differ among populations if the haplotype frequencies in those regions were considerably different among populations. Thus the SNP and haplotype frequencies for each population will be calculated, allowing comparisons. This could raise risks of group stigmatization or discrimination, if a higher frequency of a disease-associated variant were found in a population and the risks associated with that variant were overgeneralized to all or most of the members of the population. Another potential concern is that the inclusion of populations based on ancestral geography could result in categories such as "race", which are largely socially constructed, being incorrectly viewed as precise and highly meaningful biological constructs. The Project undertook the community consultations to understand community concerns about such issues.

Scientific strategy

To develop the HapMap, the samples will be genotyped for at least 1 million SNPs across the human genome. When the Project started, 2.8 million SNPs were in the public database dbSNP. However, many chromosome regions had too few SNPs, and many SNPs were too rare to be useful, so millions of additional SNPs were needed to develop the HapMap. The Project discovered another 2.8 million SNPs by September of 2003, and SNP discovery continues.

The genotyping will be carried out by ten centers in Canada, China, Japan, the United Kingdom, and the United States. Each center will genotype all the samples for its assigned chromosomes. The centers are using five genotyping technologies. The Project initially (by about June of 2004) will produce a map of 600,000 SNPs evenly spaced across the genome, which is a density of one SNP every 5000 bases. Additional SNPs will be genotyped where needed to define haplotypes. Genotyping quality will be assessed by using duplicate samples, having all centers genotype a standard set of SNPs, and having centers check some of the genotypes produced by other centers.

Data analysis

The basic data set produced by the Project will be the genotypes of the 270 individual samples and the frequencies of SNP alleles and genotypes in each population. To define the haplotypes and choose the tag SNPs, the Project will use standard measures of SNP association, such as D' and r2, and will develop new analysis methods. Because the Project's data will be freely available, other researchers also will be able to analyze the data, as well as improve the methods of analysis. The data generated by the Project will show the common patterns of genetic variation across the human genome, including the amount of variation among individuals, the regions that vary in haplotype frequencies among populations, and the extent of associations among SNPs in different chromosome regions.

Data Access and Intellectual Property Policies

The Project will release all the data it produces into the public domain, so that any researcher can use the information. The new SNPs, assays for genotyping SNPs, and frequencies of SNP alleles, genotypes, and haplotypes will be released publicly soon after they are produced. When SNPs have been genotyped densely enough to define regions of strong association, then the haplotypes, individual genotypes, and tag SNPs in those regions will be released publicly without restrictions. However, before regions accumulate this density of data, the individual genotype data will be made available under a data access policy that imposes only minimal constraints. Users must agree to not reduce others' access to the data, and to share the data only with others who have made the same agreement. The sole purpose of this temporary mechanism for data access is to make sure that Project data remain in the public domain. At the end of the Project, any data that have not yet been released will be made public.

The Project does not include studies to relate genetic variation to phenotypes such as a disease risk or drug response, i.e., a "specific utility". The participants in the Project do not believe that SNP, genotype, or haplotype data for which a specific utility has not been generated are appropriately patentable inventions. The data access policy does not prevent users from applying for patents on SNPs or haplotypes for which they have demonstrated a specific utility, as long as they do not prevent others from obtaining access to Project data. Project participants will not use Project data for other projects in their laboratories before the data are released.

Internal Data Access Policy

Participants in the International HapMap Project will not use Project data (even data that they have produced) for other projects in their laboratories before the data are publicly released, either to dbSNP (in the case of SNPs, SNP assays, and allele and genotype frequencies) or through the DCC Genotype Database (in the case of individual genotypes and haplotypes).

Participants in the International HapMap Project will obtain access to the Project data in the same way as all other users. For the genotypes and haplotype data this will be under the terms and conditions of the public access license agreement. All Project participants have affirmed their own acceptance of the conditions of the license agreement under the same terms as other users of the data.

In the absence of identified utility/function (i.e. association with a phenotype), Project participants will not apply for patents on SNPs or haplotypes that are produced for the Project. Project participants may apply for patents that relate SNPs or haplotypes to disease or function, if they have demonstrated functional evidence or other identified utility. However, as the HapMap Project does not include studies that would generate information about function or utility, such results could only be obtained in additional, non-HapMap Project studies. To use HapMap Project data in such additional studies, Project participants may only use the data that are publicly released in dbSNP or the DCC Genotype Database. If patents are applied for and issued, they will not be enforced in a way that will inhibit the access of others to the HapMap Project data.

Last updated : abouthapmap.html.en,v 1.4 2006/10/26 19:23:03 tellorui Exp

Home | About the Project | Data | Publications | Tutorial
Please send questions and comments on website to hapmap-help@ncbi.nlm.nih.gov