AWS genome data

Amazon Web Services (AWS) announced today that it had placed 200 terabytes of genomic information belonging to 1,700 individuals in its public cloud that can be accessed by anyone in the world for free. The data is part of the 1000 Genomes Project which is sponsored by the National Institute of Health and other partnerships with about 75 different companies and organizations. Researchers are looking for genetic variants that have frequencies greater than 1% across the sample set.

They claim their purpose for the 1% frequency is to help study diseases. The overall goal of AWS is to store genomic information from 2,662 individuals from around the world to advance scientific research.  This would mark the largest collection of human genetics in the world and it is being stored on AWS’ own servers. According the press release by AWS, the company is doing all this for free but charges users for the supplemental “compute” power required to analyze the data.

AWE says, “Users can, for example, use Hadoop running on AWS’ Elastic Cloud Compute (EC2) or Elastic Map Reduce Compute services to analyze the data stored in its Simple Storage Service (S3).  The 1000 Genomes Project has set its ethical standards high and that will explain why most of the 1,700 genomic datasets currently available are from anonymous donors.

According to the press release, the company has already collected samples from populations around the world including: Utah residents with Northern and Western European ancestry, people with Chinese heritage in Denver, people with Mexican heritage in Los Angeles and people with African heritage in the Southwestern United States. The announcement regarding the 1000 Genomes Project was made in commemoration of the Big Data Summit which is being held at the White House. The Summit is where governmental officials and researchers will discuss challenges and opportunities created by big data.

