This page accompanies the public-facing pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv releases.
It summarizes what changed relative to Pharos319,
how canonical, isoform, and alternate-product logic is assigned, and what the public TDL table contains.
Unlike legacy Pharos319, which used a largely static set of 20,412 proteins,
Pharos400 is protein-centric, provenance-tracked, and isoform-aware.
The public Pharos400 release is distributed as two complementary files:
pharos400_tdls_canonicals.csv: 20,654 canonical proteins used for the legacy-facing Pharos319 versus Pharos400 comparisonpharos400_tdls_full.csv: 166,833 total human UniProt protein rows spanning 20,654 canonical proteins, 22,768 isoforms, and 123,411 alternate productspharos400_tdls_canonicals.csv, 20,419 proteins are reviewed and 235 are unreviewedThis release includes both reviewed (Swiss-Prot) and unreviewed (TrEMBL) human UniProt accessions after applying the UniProt transition filtering now being disclosed by UniProt.
In practice, the Pharos400 MySQL-backed database and knowledge graph are designed to stay current with UniProt and other supporting protein data resources, rather than remaining tied to a one-time legacy protein set.
Pharos400 is not simply a larger copy of the old TDL table. For user-facing TDL comparison, the legacy
Pharos319 protein set is compared against the newly defined canonical proteins in Pharos400.
More broadly, Pharos400:
canonical, isoform, or alternate_productTarget Development Levels (TDLs) summarize how much is known about a target's development and druggability. Pharos400 carries the four IDG TDL categories used by Pharos:
| TDL | Definition |
|---|---|
Tclin |
Targets with approved-drug activities in DrugCentral and a known mechanism of action. |
Tchem |
Targets with small-molecule activities in ChEMBL or DrugCentral that satisfy the activity thresholds below. Some targets may also be manually promoted to Tchem by expert curation from other small-molecule evidence. |
Tbio |
Targets without qualifying drug or small-molecule activities that satisfy one or more biological-evidence criteria, including being above the Tdark evidence cutoffs or having experimentally supported Gene Ontology Molecular Function or Biological Process leaf-term annotation. |
Tdark |
Targets without qualifying drug or small-molecule activities and with limited available evidence, defined by meeting two or more dark-target criteria: PubMed text-mining score < 5, GeneRIF count <= 3, or antibody count <= 50. |
DrugCentral and ChEMBL activity values must be standardizable to -log molar units and must pass target-family-specific potency cutoffs:
| Target Family | Activity Cutoff |
|---|---|
| GPCRs | <= 100 nM |
| Kinases | <= 30 nM |
| Ion channels | <= 10 uM |
| Non-IDG family targets | <= 1 uM |
1 distributed across mentioned proteins by relative mention frequency.
ChEMBL activities are counted for TDL support only when they have a pChEMBL value, come from a binding assay,
use a molecule with MOL structure type, are mapped to a SINGLE_PROTEIN target, have
standard_flag = 1 with an exact standard relation, are associated with a publication, and pass the family cutoff:
kinases <= 30 nM, GPCRs <= 100 nM, nuclear receptors <= 100 nM,
ion channels <= 10 uM, and other targets <= 1 uM.
In the public-facing comparison of legacy Pharos319 proteins against the
canonical proteins in Pharos400, 2,137 proteins moved to a higher TDL category.
| Upgrade Transition | Count |
|---|---|
Tbio -> Tchem | 1,320 |
Tdark -> Tbio | 743 |
Tdark -> Tchem | 46 |
Tchem -> Tclin | 18 |
Tbio -> Tclin | 10 |
These upgrades reflect the updated Pharos400 database and knowledge graph, plus the harmonized evidence carried into the released canonical comparison set.
Pharos400 uses:
Operationally, the pipeline downloads and harmonizes all human UniProt proteins in the UniProt database, then annotates them using Ensembl, RefSeq, NCBI Gene, and HGNC support.
uniprot_id. They are intended to support reproducible
Pharos400 processing and comparison, not to serve as a new external identifier standard.
For user-facing legacy comparison, the cleanest apples-to-apples view is:
canonical_isoform_status == canonicalUniProt defines the canonical sequence as the representative sequence displayed by default for an entry and used as the coordinate basis for positional annotation. For reviewed Swiss-Prot entries, UniProt selects this representative using evidence such as function, broad expression, conserved exons, agreement with other genome curation resources such as CCDS or MANE, or, when evidence is limited, the longest sequence. UniProt generally groups protein products from one gene in one species into a single entry when possible, including isoforms from alternative splicing, alternative promoter usage, or alternative translation initiation. See the UniProt help page on canonical sequences and isoforms.
Pharos400 uses those UniProt concepts as input evidence, then applies additional cross-source grouping rules so canonical, isoform, and alternate-product rows can be represented consistently in the database and knowledge graph.
Within a uniprot_id group, the pipeline chooses the preferred representative in this priority order:
UniProtKB reviewed (Swiss-Prot))After per-UniProt grouping, rows can also be grouped across shared consolidated gene symbols when they represent related protein products. This is done conservatively so that empty-symbol proteins and unrelated proteins are not collapsed together.
canonical_isoform_statuscanonical: biologically supported canonical protein for a harmonized groupisoform: non-canonical protein with explicit isoform evidencealternate_product: non-canonical grouped product without positive isoform evidenceThe source of that label is recorded in canonical_isoform_status_source.
is_canonical: marks the biologically selected canonical protein in the harmonized groupprotein_group_anchor: marks the row used as the grouping or representative anchorA row can be an anchor without being a biologically canonical protein.
Pharos400 explicitly filters UniProt accessions associated with the current UniProt resource transition:
227,28660,453166,833UniProt has publicly disclosed this transition in its proteomes and UniProtKB/TrEMBL resources.
After removal filtering, the pipeline repairs orphaned canonical links so grouped proteins still point to a live representative.
The release metadata records the following source versions for the May 2026 build:
| Source | Version |
|---|---|
| Ensembl BioMart | 115 |
| NCBI Gene Info | 2026-05-06 |
| HGNC | 2026-05-05 |
| RefSeq (NCBI) | 2026-05-06 |
| UniProt | 2026_01 |
The public pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv files contain the following columns:
| Column | Description |
|---|---|
id | Internal IFX/NCATS row identifier used for release traceability, including isoform and alternate-product tracking. This is not intended as a public identifier standard. |
uniprot_id | Primary UniProt accession used as the public-facing protein identifier. |
tdl | Assigned Target Development Level: Tclin, Tchem, Tbio, or Tdark. |
uniprot_reviewed | Whether the UniProt record is reviewed (Swiss-Prot) or unreviewed (TrEMBL). |
uniprot_annotationScore | UniProt annotation score carried forward for the protein record. |
name | Protein name used in the released table. |
xref | Cross-reference summary carried forward for the protein row. |
idg_family | IDG protein family assignment used in Pharos. |
uniprot_function | Functional description carried forward from UniProt. |
symbol | Gene symbol associated with the protein record. |
ncbi_id | NCBI Gene identifier. |
ensembl_id | Ensembl protein or transcript identifier summary. |
canonical_isoform_status | Protein grouping state: canonical, isoform, or alternate_product. |
uniprot_isoform | UniProt isoform identifier when available. |
tdl_ligand_count | Ligand count used in TDL assignment. |
tdl_drug_count | Drug count used in TDL assignment. |
tdl_go_term_count | GO-term count used in TDL assignment. |
tdl_generif_count | GeneRIF count used in TDL assignment. |
tdl_pm_score | PubMed-based score used in TDL assignment. |
tdl_antibody_count | Antibody count used in TDL assignment. |
The public-facing Pharos comparison is Pharos319 versus the canonical proteins in Pharos400.
This is the clearest bridge from the legacy release to the new harmonized release because it avoids mixing the full isoform and alternate-product expansion into a legacy-to-current TDL comparison.
20,412 proteins in legacy Pharos31920,654 canonical proteins in Pharos40020,191 shared proteins between the two releases463 proteins new in Pharos400 relative to Pharos319221 proteins present in Pharos319 are not retained in the Pharos400 canonical comparison set due to obsolete UniProt IDs2,137 TDL upgrades relative to Pharos319
Within the 221 Pharos319 proteins not retained in the Pharos400 canonical comparison set,
117 are currently treated as truly obsolete in UniProt, 94 were reassigned
one-to-one to a replacement UniProt accession, and 10 were reassigned one-to-many.
This public comparison is based on the canonical protein subset of Pharos400.