Systematic evaluation of the impact of promoter proximal short tandem repeats on expression

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genetic variation at thousands of short tandem repeats (STRs), which consist of consecutive repeated sequences of 1-6bp, has been statistically associated with gene expression and other molecular phenotypes in humans. However, the causality and regulatory mechanisms for most of these STRs remains unknown. Massively parallel reporter assays (MPRA) enable testing the regulatory activity of a large number of synthesized variants, but have not been applied to STRs due to experimental and computational challenges. Here, we optimized an MPRA framework based on random barcoding to study the impact of variation in repeat copy number on expression. We first performed an MPRA on sequences derived from 30,516 promoter-proximal STR loci along with up to 152bp of genomic context, testing 3-4 variants with differing repeat copy numbers for each locus in HEK293T cells. We identified 1,366 loci with significant associations between repeat copy number and expression, which were enriched for positive effect sizes (P=2.08e-110). We then designed a second MPRA in which we performed deeper perturbations, including systematic manipulation of the repeat unit sequence, orientation, and copy number, with 200-300 perturbations for each of the 300 loci with the strongest signals. Our results revealed that the repeat unit sequence is the primary driver of differences in the relationship between copy number and expression across loci, whereas orientation and flanking sequence have weaker effects, primarily for AT-rich repeat units. The high resolution of these perturbations enabled us to detect non-linear effects, most notably for AAAC/GTTT repeats, which emerge only beyond a certain copy number threshold. Finally, we observed that a subset of STRs in our library show expression levels that are tightly linked with predicted DNA secondary structure formation. We repeated our perturbation MPRA in HeLa S3 cells under wildtype and RNase H1 knockdown conditions, which, via reduction in RNase H1 activity, are expected to hinder resolution of R-loops. This demonstrated that associations between copy number and expression at G-quadruplex-forming CCCCG/CGGGG repeats are particularly sensitive to loss of RNase H1, providing support for an R-loop mediated mechanism for these repeats. Altogether, we establish STRs as a critical component of the non-coding regulatory grammar and provide a framework for understanding how this dynamic form of genetic variation shapes gene expression.

Article activity feed