Skip to main content
  • Research article
  • Open access
  • Published:

Phylogenetic relationship and domain organisation of SET domain proteins of Archaeplastida

Abstract

Background

SET is a conserved protein domain with methyltransferase activity. Several genome and transcriptome data in plant lineage (Archaeplastida) are available but status of SET domain proteins in most of the plant lineage is not comprehensively analysed.

Results

In this study phylogeny and domain organisation of 506 computationally identified SET domain proteins from 16 members of plant lineage (Archaeplastida) are presented. SET domain proteins of rice and Arabidopsis are used as references. This analysis revealed conserved as well as unique features of SET domain proteins in Archaeplastida. SET domain proteins of plant lineage can be categorised into five classes- E(z), Ash, Trx, Su(var) and Orphan. Orphan class of SET proteins contain unique domains predominantly in early Archaeplastida. Contrary to previous study, this study shows first appearance of several domains like SRA on SET domain proteins in chlorophyta instead of bryophyta.

Conclusion

The present study is a framework to experimentally characterize SET domain proteins in plant lineage.

Background

Epigenetic cellular memory is heritable during mitosis and/meiosis but is not encoded in the genetic material [1, 2]. Histone post-translational modifications, DNA methylation, and non-coding RNAs and chromatin remodelling are the major components of epigenetic inheritance [3]. Numerous histone modifications like acetylation, methylation and phosphorylation occur mainly in the N-terminal tails of the histones [2, 4,5,6]. These histone modifications are associated with transcriptional repression or activation, DNA repair and DNA recombination [7]. Among various histone modifications, lysine methylation is rigorously studied in plant, fungi, and animal models [8, 9]. Lysine residues can be mono, di and tri-methylated [8]. Lysine methylation is catalyzed by 130–150 amino acid protein domain. The name SET is derived from Drosophila proteins, Su(var)3–9 [10], Enhancer of zeste, E(z) [11], and Trithorax (trx) [12] having this domain. The only exception is the H3K79 methylation mark deposited by DOT1 methyltransferase that does not contain SET domain [13].

Based on SET domain sequence similarity to Drosophila homologues, plant SET domain proteins are classified into 4 main classes [14], Enhancer of Zeste, E(z) homologs; Absent, Small, or Homeotic disks (Ash) homologs and related proteins; Trithorax (Trx) homologs and related proteins; and Suppressor of Variegation, Su(var) homologs and related proteins [14,15,16]. Su(var) and E(z) group proteins are mainly involved in transcription repression, whereas Trx and Ash group are mainly involved in transcriptional de-repression [16, 17]. A separate category is later added to include proteins with an interrupted SET domain, lowly conserved SET domain and proteins with TPR and Rubisco domains [18,19,20]. This category of proteins are not well characterized for their biochemical activity.

Here we have taken a computational approach to analyse SET domain sequences of 16 different taxa of Archaeplastida. 506 SET proteins are identified that are classified into five classes. Our work will serve as a guideline for functional and biochemical characterization of SET proteins of plant lineage.

Methods

Identification of SET domain proteins in Archaeplastida

To uncover the SET proteins in Archaeplastida, the conserved SET domain amino acid sequences from Arabidopsis thaliana (At), Oryza sativa (Os), Picea abies (Pa), Selaginella meollendorffii (Sm), Physcomitrella patens (Pp), Micromonas pusila (Mpu), Micromonas RCC299 (Mr), Ostreococcus tauri (Ot), Ostreococcus lucimerinus (Ol), Chlorella vulgaris (Cv), Chlamydomonas reinhardtii (Cr) and Volvox carteri (Vc) are retrieved by BLAST search using SET domain (PF00856) as query from PFAM database Version 30.0 (http://pfam.xfam.org/family/PF00856) [21]. The SET domain protein sequences are also retrieved from Uniprot (http://www.uniprot.org/) and Phytozome v12.1 (https://phytozome.jgi.doe.gov/pz/portal.html) that are not available in PFAM. To better understand the evolutionary history and functionality of SET domain-containing proteins in plant lineages, we have use assembled transcriptome data from, Nitella mirabilis (Nm) (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/PRJNA158153 158153), Klebsormidium flaccidum (Kf) (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/ PRJNA 51159). Cyanophora paradoxa (http://cyanophora.rutgers.edu/cyanophora/) and predicted proteome data for Marchantia polymorpha (Mp) (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/bioproject/PRJNA218052) and Gymnosperm, Picea abies (Pa) (http://marchantia.info/tom/blast/blast.html tom/blast/blast.html). In in silico proteome datasets, if two or more overlapping protein sequences arose from same genetic locus, we have selected the longest sequence. Transcript sequences obtained from NCBI website are transdecoded into predicted proteins using Transdecoder (https://transdecoder.github.io/) that gave the longest open reading frame. Local Blastp (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. nih.gov/blast/executables/blast+/LATEST/) is carried out with these predicted proteins against 130–150 length SET domain sequences from Arabidopsis thaliana to filter the SET domain containing sequences. Thereafter, redundant SET domain amino acid sequences are removed from all the SET domain sequences obtained from the online search. The Arabidopsis thaliana and Oryza sativa SET domains protein sequences are used as queries because it is a representative of flowering plants and possess relatively complete annotated protein information.

Domain architecture

Presence or absence of specific domain and its organisation might provide an intimation to the functional divergence of the different protein groups. Additionally, it is conventional that proteins sharing identical domain may share close evolutionary relationship and may be functionally related. Predicted proteins from PFAM and local Blast were searched for the signature SET domain and their associated domain using NCBI Batch CD search tool (http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/Structure/bwrpsb/bwrpsb.cgi/) and NCBI-CD Hit (https://ww w.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).

Phylogenetic analysis

For phylogenetic tree construction, 251 SET domain sequences retrieved from the E(z), Ash, Trx and Su(var) group are included. SET domain containing proteins from the Orphan, SETD, and TPR are excluded due to low sequence similarity with E(z), Ash, Trx and Su(var). Multiple sequence alignments of SET domain sequences are performed using the ClustalW program. Evolutionary analyses are conducted in MEGA7.0 program [22]. The evolutionary history is inferred by using the Maximum Likelihood method based on the JTT matrix based model with a bootstrap test of 1000 replicates for internal branch reliability and use all sites option as Gaps/Missing data treatment. The tree with the highest log likelihood (−39,059.190) is shown. Initial tree(s) for the heuristic search are obtained automatically by applying Neighbour-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value.

Multiple alignments of SET domain sequences

Multiple sequence alignment is performed for the unannotated conserved SET domain sequences of plant lineages with reference Arabidopsis and rice SET domain sequences using multalin program (http://multalin.toulouse.inra.fr/multalin/) with the default parameters. SET domain sequences sharing significant homology with Arabidopsis and rice. SET domain sequences are categorized to the corresponding group of the SET domain. Classification is also interpreted based on neighbouring domain organization.

Nomenclature and classification of SET domain proteins

SET domain proteins are named according to the nomenclature of annotated plant and animal SET domain histone methyltransferases with generic capitalized letter followed by species name in the lower case using the common specific protein name. SET domain proteins are named according to the name of the closest Arabidopsis thaliana and Oryza sativa homolog. If two or more proteins were equally close to a single Arabidopsis thaliana and Oryza sativa SET domain containing proteins, then the same name is used followed by the letters “a”, “b”, etc. The following nomenclature is adopted for classes and subclasses of the SET domain-containing proteins: For e.g. protein type, II-2C would belong to Class II, subclass 2 and subdivision of subclass C.

Results

Identification and classification of SET domain proteins of Archaeplastida into five classes

To get an insight into the SET domain proteins from plant lineage and their probable functional relationship, we performed a homology search from 16 Archaeplastida against SET domain sequences retrieved from PFAM and NCBI transcriptome assembly. A total of 506 candidate SET domain containing protein sequences are identified (Fig. 1, Additional file 1: Table S1 and Additional file 2: Table S2). This SET domain homology search identified a range of 25–50 SET domain containing proteins in each plant species. SET proteins identified in this study range from 200 to 3500 amino acid in length. The numbers of the SET domain-containing proteins show a non-linear increment with the evolution of compelxification (Fig. 1).

Fig. 1
figure 1

Graphical representation of SET domain containing proteins in 16 representative Archaeplastida. SET domain proteins can be grouped into 5 classes. Class I, E(z); Class II, Ash; Class III, Trx; Class IV, Su(var); Class V includes Orphan, SETD, and TPR. Each class is color coded

SET domain proteins can be grouped into 5 classes. Class I as E(z), Class II as Ash, Class III as Trx and Class IV as Su(var) and Class V. Class II, III and IV each has also a subclass called Orphan that has SET domain with significantly low homology to the conserved SET domain or absence of signature domains of the respective class. Class V includes Orphan, SETD, and TPR with low sequence similarity in the SET domains to other 4 groups. SETD protein family has a characteristic Rubisco domain. TPR has tetratricopeptide domain while Orphan family have either an interrupted SET domain or have SET domain with few associated domains (Fig. 1). Classification presented here is uniform with already annotated histone methyltransferases from other plant species with few modifications. Few of the genomes considered in the present study are in draft stage, hence some proteins in our analysis may be missing or may have errors in the predicted protein structure.

The numbers of the SET domain containing proteins in unicellular glaucophyta species, Cyanophora paradoxa, (16) are about 3-fold lower compared to pteridophyta species, Selaginella meollendorffii (51). Nonetheless, small number of SET domain proteins detected in Cyanophora paradoxa may also be due to missing proteins in the currently available proteome. E(z) group of proteins is not detected in, glaucophyta and chlorophyta sp. Volvox carteri though they are present in another chlorophyta species, Chlamydomonas reinhardtii. Ash and Trx group of proteins are identified as early as in glaucophyta and there is a nonlinear increase in the number of Ash group with the increase in complexification of species. Among the four main subfamilies, Ash proteins have maintained nearly stable numbers among plant species. However, a noteworthy and surprising feature is the absence of Ash group of proteins in Micromonas pusila (Fig. 1). Trx proteins show a sudden doubling in number in charophyta, marchantiophyta, bryophyta, and pteridophyta as compared to chlorophyta, glaucophyta. Su(var) proteins are absent in glaucophyta and chlorophyta species, volvox carteri, Ostreococcus sp. and Micromonas RCC299 and then there is a gradual increase in the number among, marchantiophyta, bryophyta, and pteridophyta. Tetratricopeptide repeat (TPR) group of proteins is first identified in chlorophyta, Chlamydomonas reinhardtii with their highest abundance in pteridophytes. SETD proteins are identified in chlorophyta but there is no proportional increase in number in various species of plant lineage. The absence of SETD in glaucophyta and charophyta remains controversial. Unlikely, Orphan proteins are identified since the emergence of glaucophyta. Orphan protein shows highest abundance in Micromonas RCC299, while Marchantia polymorpha possess a minimum number of Orphan family proteins. Sequence comparisons are performed among SET domain containing proteins from other Archeaplastida members and with reference species, Arabidopsis thaliana and Oryza sativa. Increased numbers and diversification of encoded SET domain proteins in embryophyta perhaps reflect increased diversity and complexification of multicellular programs. Characterization of these domains will be useful in understanding functions of SET domain proteins.

Phylogenetic analysis

Two hundred and fifty one SET domain proteins belonging to E(z), Ash, Trx, and Su(var) are considered for phylogenetic clustering. Remaining SET domain candidate proteins from Orphan, SETD, and TPR are excluded from phylogenetic analysis due to low sequence similarity. The multiple sequence alignment of the 251 SET domain protein sequences are provided in Additional file 3: Figure S1. Based on the presence and absence of group-specific architectural motifs and sequence alignment, these 251 SET domain protein sequences are classified into E(z), Ash, Trx, and Su(var). A noteworthy point is that many Ash and Trx related proteins diverge from their respective branch (Fig. 2).

Fig. 2
figure 2

Phylogenetic tree of SET-domain proteins in Archaeplastida. The SET domain sequences of 251 proteins from 16 species are aligned using ClustalW. Phylogenetic analysis is performed using MEGA 7.0. SET domain proteins of Archaeplastida are divided into four distinct groups: E(z), Ash, Trx and Su(var)3–9 proteins. Branch nodes are colored distinctly for separate clade mentioned in the accompanying legends. The dark blue rectangle indicates E (z) group, Cyan blue triangle indicates Ash group, the Pink triangle indicates Trx group while green rectangle indicates Suv group. The evolutionary history is inferred by using the Maximum Likelihood method based on the JTT matrix-based model. The tree with the highest log likelihood (−39,445.1504) is shown. Initial trees for the heuristic search are obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value

Class I-enhancer of Zeste, E(z)

E(z) is the catalytic component of Polycomb Repressive complex2 characterized in few members of plant lineage, catalysing the di- and tri-methylation of H3K27 [23]. E(z) class of proteins are named as Class I SET domain proteins in our study. Based on presence of other domains in combination of SET domain, seven variations of E(z) proteins is noted. Unlike other classes, this class of proteins appears to have strict and less diversified domain combinations. (Fig. 3). A PreSET domain is present in moncot, rice and pteridophyte, Selaginella. A vanadium binding domain is present in Physcomitrella patens and Marchantia polymorpha. Cysteine/serine-rich nuclear protein, (CSR) and Male-specific lethal (MSL) domains are Ostreococcus specific . Cyanophora and Volvox lack this class of protein. Interestingly most members in plant lineage contain only one E(z) homolog whereas Marchantia and Arabidopsis contain three and Picea contains two homologs.

Fig. 3
figure 3

Domain architecture of the E(z) family. Schematic diagrams show the domain organization of E(z) proteins. For the E(z) family, seven major representative subgroups differing by the positioning of the SET domain and type of associated domains are shown. The corresponding name of the species sharing the specific domain architecture are placed on the right side. The divergent domains are indicated by different color in the figure with the names on top. SANT: Swi3, Ada2, N-Cor, and TFIIIB; TCR:Tesmin/TSO1; Vn: Vanadium binding protein; ALDH-SF: aldehyde dehydrogenase superfamily; CSR: Cysteine/serine-rich nuclear protein; MSL: Male-specific lethal. Domains are not drawn to scale. Scale bars indicate 100 amino acids. (At- Arabidopsis thaliana; Os-Oryza sativa; Pa-Picea abies; Sm-Selaginella meollendorffii; Pp-Physcomitrella patens; Mp-Marchantia polymorpha; Nm-Nitella mirabilis; Kf-Klebsormidium flaccidium; Mr.-Micromonas RCC299; Mpu-Micromonas pusila; Ot-Ostreococcus tauri; Ol-Ostreococcus lumiferans; Cv-Chlorella vulgaris; Cr-Chlamydomonas reinhardtii; Vc-Volvox carteri; Cp-Cyanophora paradoxa)

Class II- absent, small, or Homeotic disks (ash) and related proteins

Ash class of proteins have diverse domain combinations. Based on domain combinations a novel classification of Ash proteins is presented here. Ash proteins are classified into 4 classes, Class II-1 to Class II-3 and II-Orphan (Table 1). Class II-1 and II-2 are divided into 2 subclasses whereas Class II-3 is subdivided into 3 subclasses. Except orphan class, all proteins have AWS (Associated with SET domain) suggesting its indispensible role in functions of Ash class of proteins (Fig. 4). Class II-1A has AWS domain followed by SET domain, whereas Class II-1B has PostSET domain. A chlorophyta member, Chlorella vulgaris shows the initial presence of PostSET domain in Ash family (Additional file 4: Figure S2) Class II-2 has Plant Homeo Domain (PHD) followed by AWS and SET domain. Very interestingly, Physcomitrella patens, Marchantia polymorpha, and Nitella mirabilis have three PHD domains in tandem. Most of the class II-3 proteins have zinc fingers. This domain is introduced in charophyta species, Micromonas RCC299 (Additional file 4: Figure S2). This class of proteins are likely to methylate multiple amino acids at different positions in histones.

Table 1 Classification of Class II-Ash proteins
Fig. 4
figure 4

Domain architecture of the Ash family. Schematic representation of Ash family proteins. Three major representative subgroups differing by the positioning of the SET domain and type of associated domain domains are shown along with an orphan group. These groups are further sub divided. The right side of the protein indicates the abbreviated form of the species sharing a particular protein architecture. Associated domains are indicated by different colored boxes with the names. AWS: Associated with SET; PHD: Plant homeodomain; PHD_NSD: Plant homeodomain found in nuclear receptor-binding SET domain-containing proteins; Zf-MYND: Myeloid nervy, deaf; Zf_C: Zinc binding motif composed of cysteine motif; TPR: Tetratricopeptide Repeat; Zf-CW: Zinc finger domain with conserved cysteine and tryptophan residues; TUDOR: Royal family protein; PLN03081: pentatricopeptide (PPR) repeat containing protein; PKC: Protein kinase catalytic domain; TNG2: T-cell leukemia neighbouring genes; ATP_11: Adenosine tri phosphate 11; PHA_03420: E4 protein; PHA02669: Hypothetical protein; CITED: CBP/p300-interacting transactivator with ED-rich tail, FAM196: Family of unknown function; SRI: Set2 Rbp1 interacting; Zf_R: Zn-finger in Ran binding protein and others; SoxC: Sry-related HMG box; Drf_FM1: Diaphanous related formin homology region1; Me425_SD1: Mediator complex 25 iynapsin 1; LIM: Lin11, Isl-1 & Mec-3. Domains are not drawn to scale. Scale bars indicate 100 amino acids. (At- Arabidopsis thaliana; Os-Oryza sativa; Pa-Picea abies; Sm-Selaginella meollendorffii; Pp-Physcomitrella patens; Mp-Marchantia polymorpha; Nm-Nitella mirabilis; Kf-Klebsormidium flaccidium; Mr.-Micromonas RCC299; Mpu-Micromonas pusila; Ot-Ostreococcus tauri; Ol-Ostreococcus lumiferans; Cv-Chlorella vulgaris; Cr-Chlamydomonas reinhardtii; Vc-Volvox carteri; Cp-Cyanophora paradoxa)

Class III-Trithorax Homologs and related proteins (Trx)

Based on earlier classification, Trx proteins are designated as Class III and subdivided into four subclasses, Class III-1 to Class III-4 [24]. Each subclass is further divided into subclasses-III-1A to E, III-2A to D, III-3A to F and III-4A, B and orphan (Table 2). In addition to the SET domain, Class III proteins bear several domains like Proline-Tryptophan-Tryptophan-Proline residue domain (PWWP), Phenylalanine-Tyrosine residue domain (FYR), PHD and PostSET domains (Fig. 5). PWWP, PHD, FYR domains are detected as early as in chlorophyta species, Volvox carteri (Additional file 5: Figure S3). Class III-1 in Trx uniquely has FYR and PWWP domains. The FYR domain is composed of an FYR N-terminal portion and FYR C-terminal portion that usually occurs near each other. Few species like Chlamydomonas reinhardtii, Arabidopsis thaliana and Oryza sativa have Royal Family domain (TUDOR) in this class of proteins. Variations in the number of PHD domain were noted among the subclass of Class III-1. The III-2 group usually contains one PWWP domain, two to three PHD domain, and one PostSET domain without FYR domain (Fig. 5). Class III-2A lacks the PostSET domain. A unique feature of Class III-2C is the presence of Zinc Finger (ZnF), Sp100, AIRE-1, NucP41/75, DEAF-1 (SAND) and High mobility group box (HMGb) domain along with canonical PWWP, PHD, SET and PostSET domain. Another striking feature is the presence of both AWS and PWWP domain (signature domain of Class II and Class III respectively) in a single protein type in Class III-2D. Class III-3 lacks PWWP domain and is characterized by the presence of PHD and SET domains. Subclass III-3B and III-3E have an additional PostSET domain. Class III-3D possess a single PreSET domain whereas Class III-3F possess HMG domain. Class III-3 possess additional Apetala 2 (AP2), Agnet (Royal family domain), Bromo-adjacent homology (BAH) and Malignant brain tumor (MBT) domains in addition to the canonical domains. Glaucophyta lacks all other domains except SET and PostSET. Class III-4A contains one additional Glycine-Tyrosine-Phenylalanine residue (GYF), ZnF_C2H2 and HMG domains. These domains are reported for RNA binding or single-stranded DNA binding in eukaryotic proteins [25]. Class III-4B usually lack additional domains except for a SET and a PostSET domain without an additional domain.

Table 2 Classification of Class III-Trx proteins
Fig. 5
figure 5

Domain architecture of the Trx family. Representative domain organization of the Trx proteins. Trx proteins are divided into four groups with subgrouping of each group. Right side panel indicates the abbreviated form of the species sharing the domain arrangement. PHD1: Plant homeodomain1; PHD_SF: Plant homeodomain superfamily; Ephd: extended plant homeodomain; FYRN_C: phenylalanine/tyrosine-rich domain in N-terminal and C-terminal region; Z: zinc knuckle is a zinc binding motif composed of cysteine residue motifs found mostly from retroviral gag proteins; HMG-b: High mobility group box; Bro-Bromodomain; AP2: Apetala2; TUDOR: Royal family protein; NHP6B: Chromatin-associated proteins containing the HMG domain; Agnet: Royal family domain; DUF_3839: Domain of unknown function; Jas: Jasmonate motif; BAH: Bromo-adjacent homology; MBT: Malignant brain tumor; PTZ_00368: hypothetical protein; GYF: GYF domain: contains conserved Gly-Tyr-Phe; MM_CoA: Methylmalonyl CO enzyme A; SAND: Sp100, AIRE-1, NucP41/75, DEAF-1 domain. Domains are not drawn to scale. Scale bars indicate 100 amino acids. (At- Arabidopsis thaliana; Os-Oryza sativa; Pa-Picea abies; Sm-Selaginella meollendorffii; Pp-Physcomitrella patens; Mp-Marchantia polymorpha; Nm-Nitella mirabilis; Kf-Klebsormidium flaccidium; Mr.-Micromonas RCC299; Mpu-Micromonas pusila; Ot-Ostreococcus tauri; Ol-Ostreococcus lumiferans; Cv-Chlorella vulgaris; Cr-Chlamydomonas reinhardtii; Vc-Volvox carteri; Cp-Cyanophora paradoxa)

Class IV-suppressor of variegation Homologs and relatives Su(var)3–9

Su(var) comprises of a large group of proteins characterized by the presence of PreSET, SET and PostSET domains. This class is designated here as Class IV (Fig. 6). This class is divided in to two - IV-1 & 2. IV-1 is subdivided into IV-1A to C. IV-2 is subdivided into IV-2A to G. Most members of Class IV-1 has a characteristic arrangement of SET and Ring finger Associated (SAD_SRA) domain followed by a PreSET located N terminal to the SET domain. Only exception is the class IV-1B where PreSET domain is absent in between SAD_SRA and SET domain. SAD_SRA domain binds to methylated cytosines [26]. Class IV-1B proteins are present in Chlorella vulgaris and Micromonas pusila. Class IV-C possess two SET domains located C-terminal to SAD_SRA and PreSET domain. This class of protein is present in Chlamydomonas reinhardtii. All members of class IV-2 has a PreSET domain before SET domain with the absence of SAD_SRA domain. Class IV-2C contains one or more ZnF_C2H2 domain(s) at its N-terminus and only found in Marchantia polymorpha whereas Class IV-2D contains one Ubiquitin-binding WIYLD domain (WIYLD) at its N-terminus along with PreSET and SET, found in Arabidopsis thaliana, Oryza sativa, Picea abies, and Marchantia polymorpha (Table 3). This domain organization might have arisen relatively recently in Marchantia polymorpha, belonging to embryophyta (Additional file 6: Figure S4). It is worth noting that unlike we see that SRA_SET is already present in chlorophyte contrary to the previous report of its first appearance in bryophytes [27]. Our analysis shows that PostSET domain is present only in Marchantia polymorpha and Nitella mirabilis, categorized under Class IV-2C and Class IV-2E respectively. PostSET domain might have been lost by some members during the subsequent evolution (Additional file 6: Figure S4).

Fig. 6
figure 6

Domain organisations of the Su(var)3–9 family. Su(var)3–9 family proteins are grouped into two groups and further subgroups to a total of 17 different domain combinations. Species sharing the specific domain arrangement were indicated on right-hand side. Different protein domains are colored differently as indicated. SRA: YDG-SET and ring finger associated; TPR: Tetratricopeptide repeat; Z-TRM: tRNA 2′-O-methyltransferase Trm13; WIYLD: Ubiquitin-binding WIYLD domain; DUF3574: Domain of unknown function; COG5281: Phage related minor tail protein; LaMG: Laminin G domain; UBA: Ubiquitin associated domain. Domains are not drawn to scale. Scale bars indicate 100 amino acids. (At- Arabidopsis thaliana; Os-Oryza sativa; Pa-Picea abies; Sm-Selaginella meollendorffii; Pp-Physcomitrella patens; Mp-Marchantia polymorpha; Nm-Nitella mirabilis; Kf-Klebsormidium flaccidium; Mr.-Micromonas RCC299; Mpu-Micromonas pusila; Ot-Ostreococcus tauri; Ol-Ostreococcus lumiferans; Cv-Chlorella vulgaris; Cr-Chlamydomonas reinhardtii; Vc-Volvox carteri; Cp-Cyanophora paradoxa)

Table 3 Classification of Class IV-Su(var) proteins

Class V-orphan, SETD and TPR

Class V category of proteins share a low level of sequence identity to conserved SET domain proteins, hence are grouped separately. Class V comprises of Orphan proteins, SETD and TPR. Diverse associated domains found in other groups are absent in Class V (Additional file 7: Figures S5, and Additional file 8: Figure S6). Few members also exhibit interrupted SET domain. Many unique domains in the Orphan group such as Regulator of chromosome condensation (RCC1); bHLH-MYC (Basic helix loop helix binding to MYC transcription factors); HLH (Helix loop helix); BASP1 (Brain acid soluble protein 1), Ribosomal protein L1(rp1A) and DUF4239 (Domain of unknown function) are found that are not common in other SET domain proteins [28,29,30]. SETD proteins contain a Rubisco Lysine Serine Methyl Transferase (LSMT) substrate binding domain, which allows the protein to bind to the N-terminal tails of histones H3 and H4 [31]. The SET domain of SETD also has highly divergent sequences compared to canonical SET proteins and propose to methylate non-histone targets [32, 33]. (Additional file 7: Figures S5, and Additional file 8: Figure S6). Nevertheless, the specific methylation features within the SET region are retained [31, 32]. TPR group of proteins too lack a significant homology to canonical SET domain sequence. This protein family has a unique domain of tetratricopeptide repeats of a minimum of 34 amino acids.

Discussion

In the present study, we have computationally identified 506 SET domain proteins from the published genomes and transcriptomes of 16 members of plant lineage Archaeplastida. We have performed phylogenetic analysis of SET domains and classify proteins according to their domain organisations. Published SET domain phylogenies of rice and Arabidopsis are used as template [14, 15, 19]. This classification may reflect functional relatedness and is a basis to choose candidates to experimentally elucidate functions of SET domain proteins. In future it will be interesting to see if the loss and gain of various domains in SET domain proteins presented here is relevant to growth pattern, habitat, and alteration of generation in evolutionary history of plant lineage. Introduction of PHD domain in Ash group of proteins, Zn finger C2H2 in Trx group and PostSET in Su(var) in charophytes prompts to speculate these changes in aquatic to benthic lifestyle.

Like Arabidopsis, rice and other flowering plants plant lineage appears to have relatively higher number of SET domain proteins than yeasts and animals [15, 33]. Interestingly, numbers of individual class of proteins vary dramatically in different plant groups indicating lineage specific or redundant functions of these proteins.

Plant SET domain proteins have fewer domains. It might be that in plant lineage, domains remains dispersed in larger number of proteins. Tesmin/TSO1(TCR) domain is specifically associated with E(z) class of proteins. Glaucophyte neither have E(z) proteins nor TCR in SET domain proteins.

Other interesting feature in the evolution of the E(z) protein is the presence of the Swi3, Ada2, N-Cor, and TFIIIB (SANT) domain only in rice is indicative of monocot specific function. Aldehyde dehydrogenase superfamily (ALDH-SF); Cysteine/serine-rich nuclear protein (CSR); Male-specific lethal (MSI) are biochemically uncharacterized domains associated with SET domains of E (z) members.

Ash class of proteins have varied domain combinations, therefore we have newly classified them. Except in orphans, AWS domain is very tightly associated N terminally to the SET domain with Ash class of proteins. An interesting feature is the presence of single PHD domain in class II-2B whereas three of them on a protein in class II-2A. TUDOR domain is reported to be absent from plant lineage but we confirm its presence in Ash class proteins of Selaginella meollendorffii and Marchantia polymorpha [19]. This domain is also present in Trx class of proteins.

In general Trx class of proteins have larger number of domains. This is the trend towards SET domain proteins of animals. Predominantly PWWP and PHD domains are associated with Trx class of proteins. TUDOR domain has already appeared in SET domain protein in chlorophyta.

Phylogenetic classification of Su(var)3–9 SET proteins of Archaeplastida into two classes is based on the presence or absence of SRA domains. Our analysis reveal WIYLD and Zinc Finger domains of Class IV to have originated in marchantiophyta (Marchantia polymorpha). SRA domain emerged as early as chlorophyta and PostSET domain of Su(var)3–9 SET traced back to charophytes and then lost in some members during the subsequent evolution. (Fig. 6, Additional file 6: Figure S4, Table 3). We speculate that higher number of Su(var)3–9 SET proteins in flowering plant species is due to recent duplications after their divergence from other land plants and they play role in plant development related to flowering. This is consistent with role of one of the member of this class, KRYPTONITE in flower development [34].

In context to the diversity of life forms and its connection with the evolution of epigenetic regulations, it is noteworthy that the complexification of a species in terms of multicellularity, habitat, and forms are supposedly not connected with the epigenetic regulations. This analysis suggests that diversification of the SET domain proteins may not be the sole determinant related to the different biology of the species. Another possibility might be that orthologous proteins play diverse roles in different species. Therefore, it appears that during evolution, an additional domain are acquired along with the SET domain to enable supplementary functions. Hence, we believe that the present identification and classification of SET domain proteins of Archaeplastida will accelerate biochemical characterization of these proteins.

Conclusions

We have identified and analyzed 506 SET domain proteins from 16 members of Archaeplastida through sequence homology and phylogenetic approach. We grouped these SET domain proteins into five classes- E(z), Ash, Trx, Su(var) and Orphan. Our work provided framework for experimental characterization of plant SET domain proteins.

References

  1. Russo VE, Martienssen RA, Riggs AD. Epigenetic mechanisms of gene regulation. NY, USA: Cold Spring Harbor Laboratory Press; 1996.

  2. Saze H. Epigenetic memory transmission through mitosis and meiosis in plants. Seminars in cell & developmental Biology. 2008;19(6):527-36.

  3. Goldberg AD, Allis CD, Bernstein E. Epigenetics: a landscape takes shape. Cell. 2007;128(4):635–8.

    Article  CAS  PubMed  Google Scholar 

  4. Berger SL. The complex language of chromatin regulation during transcription. Nature. 2007;447(7143):407–12.

    Article  CAS  PubMed  Google Scholar 

  5. Strahl BD, Allis CD. The language of covalent histone modifications. Nature. 2000;403(6765):41–5.

    Article  CAS  PubMed  Google Scholar 

  6. Patel DJ, Wang Z. Readout of epigenetic modifications. Annu Rev Biochem. 2013;82:81–118.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Ho L, Crabtree GR. Chromatin remodelling during development. Nature. 2010;463(7280):474–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Liu C, Lu F, Cui X, Cao X. Histone methylation in higher plants. Annu Rev Plant Biol. 2010;61:395–420.

    Article  CAS  PubMed  Google Scholar 

  9. Freitag M. Histone Methylation by SET domain proteins in the Fungi. Annu Rev Microbiol. 2017;8(71):413-39.

  10. Tschiersch B, Hofmann A, Krauss V, Dorn R, Korge G, Reuter G. The protein encoded by the Drosophila position-effect variegation suppressor gene Su (var) 3–9 combines domains of antagonistic regulators of homeotic gene complexes. EMBO J. 1994;13(16):3822.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Jones R, Gelbart W. The drosophila Polycomb-group gene enhancer of zeste contains a region with sequence similarity to trithorax. Mol Cell Biol. 1993;13(10):6357–66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Stassen MJ, Bailey D, Nelson S, Chinwalla V, Harte PJ. The drosophila trithorax proteins contain a novel variant of the nuclear receptor type DNA binding domain and an ancient conserved motif found in other chromosomal proteins. Mech Dev. 1995;52(2):209–23.

    Article  CAS  PubMed  Google Scholar 

  13. Feng Q, Wang H, Ng HH, Erdjument-Bromage H, Tempst P, Struhl K, Zhang Y. Methylation of H3-lysine 79 is mediated by a new family of HMTases without a SET domain. Curr Biol. 2002;12(12):1052–8.

    Article  CAS  PubMed  Google Scholar 

  14. Baumbusch LO, Thorstensen T, Krauss V, Fischer A, Naumann K, Assalkhou R, Schulz I, Reuter G, Aalen RB. The Arabidopsis Thaliana genome contains at least 29 active genes encoding SET domain proteins that can be assigned to four evolutionarily conserved classes. Nucleic Acids Res. 2001;29(21):4319–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Springer NM, Napoli CA, Selinger DA, Pandey R, Cone KC, Chandler VL, Kaeppler HF, Kaeppler SM. Comparative analysis of SET domain proteins in maize and Arabidopsis reveals multiple duplications preceding the divergence of monocots and dicots. Plant Physiol. 2003;132(2):907–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Dillon SC, Zhang X, Trievel RC, Cheng X. The SET-domain protein superfamily: protein lysine methyltransferases. Genome Biol. 2005;6(8):227.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Jenuwein T, Allis CD. Translating the histone code. Science. 2001;293(5532):1074–80.

    Article  CAS  PubMed  Google Scholar 

  18. Ying Z, Mulligan RM, Janney N, Houtz RL. Rubisco small and large SubunitN-Methyltransferases Bi-and mono-functional methyltransferases that methylate the small and large subunits of Rubisco. J Biol Chem. 1999;274(51):36750–6.

    Article  CAS  PubMed  Google Scholar 

  19. Zhang L, Ma H. Complex evolutionary history and diverse domain organization of SET proteins suggest divergent regulatory interactions. New Phytol. 2012;195(1):248–63.

    Article  CAS  PubMed  Google Scholar 

  20. Ng DW, Wang T, Chandrasekharan MB, Aramayo R, Kertbundit S, Hall TC. Plant SET domain-containing proteins: structure, function and regulation. Biochim Biophysica Acta. 2007;1769(5):316–29.

    Article  CAS  Google Scholar 

  21. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas A. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44(D1):D279–85.

    Article  CAS  PubMed  Google Scholar 

  22. Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870–4.

    Article  CAS  PubMed  Google Scholar 

  23. Cao R, Wang L, Wang H, Xia L, Erdjument-Bromage H, Tempst P, Jones RS, Zhang Y. Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science. 2002;298(5595):1039–43.

    Article  CAS  PubMed  Google Scholar 

  24. Zhu X, Chen C, Wang B. Phylogenetics and evolution of Trx SET genes in fully sequenced land plants. Genome. 2012;55(4):269–80.

    Article  CAS  PubMed  Google Scholar 

  25. Freund C, Dötsch V, Nishizawa K, Reinherz EL, Wagner G. The GYF domain is a novel structural fold that is involved in lymphoid signaling through proline-rich sequences. Nat Struct Mol Biol. 1999;6(7):656–60.

    Article  CAS  Google Scholar 

  26. Arita K, Ariyoshi M, Tochio H, Nakamura Y, Shirakawa M. Recognition of hemi-methylated DNA by the SRA protein UHRF1 by a base-flipping mechanism. Nature. 2008;455(7214):818–21.

    Article  CAS  PubMed  Google Scholar 

  27. Zhu X, Ma H, Chen Z. Phylogenetics and evolution of Su (var) 3-9 SET genes in land plants: rapid diversification in structure and function. BMC Evol Biol. 2011;11(1):63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Garbarini N, Delpire E. The RCC1 domain of protein associated with Myc (PAM) interacts with and regulates KCC2. Cell Physiol Biochem. 2008;22(1–4):031–44.

    Article  CAS  Google Scholar 

  29. Huq E, Quail PH. PIF4, a phytochrome-interacting bHLH factor, functions as a negative regulator of phytochrome B signaling in Arabidopsis. EMBO J. 2002;21(10):2441–50.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Hansen SF, Harholt J, Oikawa A, Scheller HV. Plant glycosyltransferases beyond CAZy: a perspective on DUF families. Front Plant Sci. 2012;3

  31. Trievel RC, Flynn EM, Houtz RL, Hurley JH. Mechanism of multiple lysine methylation by the SET domain enzyme Rubisco LSMT. Nat Struct Mol Biol. 2003;10(7):545–52.

    Article  CAS  Google Scholar 

  32. Trievel RC, Beach BM, Dirk LM, Houtz RL, Hurley JH: Structure and catalytic mechanism of a SET domain protein methyltransferase. Cell 2002, 111(1):91-103.

  33. Yadav CB, Muthamilarasan M, Anand Dangi SS, Prasad M. Comprehensive analysis of SET domain gene family in foxtail millet identifies the putative role of SiSET14 in abiotic stress tolerance. Sci Rep. 2016;6

  34. Jackson JP, Lindroth AM, Cao X, Jacobsen SE. Control of CpNpG DNA methylation by the KRYPTONITE histone H3 methyltransferase. Nature. 2002;416(6880):556–60.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors wish to thank Chhavi Dawar and Divya Tej Sowpati from Centre for Cellular and Molecular Biology (CSIR), Hyderabad, India for helping in computational work.

Funding

ML is thankful to Department of Biotechnology (DBT), Govt. of India and CSIR-CCMB for financial assistance. SS acknowledges DBT India for providing post-doctoral research fellowship.

Availability of data and materials

The data sets supporting the conclusions of this article are included within its additional files.

Author information

Authors and Affiliations

Authors

Contributions

SS collected data, carried out the bioinformatics analyses, and drafted the manuscript. ML conceived and supervised the study, drafted and revised the manuscript. Both authors read and approved the final manuscript.

Corresponding authors

Correspondence to Supriya Sarma or Mukesh Lodha.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1: Table S1.

List of species considered in the present study with the corresponding number of SET domain containing protein. (PDF 10 kb)

Additional file 2: Table S2.

List of the SET domain containing proteins from species considered in the present study with their corresponding sequence id. (PDF 65 kb)

Additional file 3: Figure S1.

Multiple sequence alignment of the 251 SET domain protein sequences from E(z), Ash, Trx and Su(var) of the 16 Archeplastida species. (PDF 74 kb)

Additional file 4: Figure S2.

Introduction of the domains in the Ash SET protein in cryptogam plant lineages. The black arrow indicates the introduction of the indicated domain in the specifically mentioned Archaeplastida species. (PDF 212 kb)

Additional file 5: Figure S3.

Introduction of the domains in the Trx SET protein in plant lineages. The black arrow indicates the introduction of the indicated domain in the specifically mentioned Archaeplastida species. (PDF 221 kb)

Additional file 6: Figure S4.

Introduction of the domains in the Su(var) SET protein in plant lineages. The black arrow indicates the introduction of the indicated domain in the specifically mentioned Archaeplastida species. (PDF 85 kb)

Additional file 7: Figure S5.

Schematic diagrams showing the domain organization of Orphan proteins. (PDF 550 kb)

Additional file 8: Figure S6.

Domain architecture of Class V- A) SETD and B) TPR family. (PDF 507 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarma, S., Lodha, M. Phylogenetic relationship and domain organisation of SET domain proteins of Archaeplastida. BMC Plant Biol 17, 238 (2017). https://0-doi-org.brum.beds.ac.uk/10.1186/s12870-017-1177-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12870-017-1177-1

Keywords