This page accompanies the public-facing pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv releases.
It summarizes what changed relative to Pharos319,
how canonical and isoform logic is assigned, and what the public TDL table contains.
Unlike legacy Pharos319, which used a largely static set of 20,412 proteins,
Pharos400 is protein-centric, provenance-tracked, and isoform-aware.
The public Pharos400 release is distributed as two complementary files:
pharos400_tdls_canonicals.csv: 20,654 canonical proteins used for the legacy-facing Pharos319 versus Pharos400 comparisonpharos400_tdls_full.csv: 166,833 total human UniProt protein rows spanning 20,654 canonical proteins, 22,768 isoforms, and 123,411 alternate productspharos400_tdls_canonicals.csv, 20,419 proteins are reviewed and 235 are unreviewedThis release includes both reviewed (Swiss-Prot) and unreviewed (TrEMBL) human UniProt accessions after applying the UniProt transition filtering now being disclosed by UniProt.
In practice, the Pharos400 model is designed to stay current with UniProt and other supporting protein data resources, rather than remaining tied to a one-time legacy protein set.
Pharos400 is not simply a larger copy of the old TDL table. For user-facing TDL comparison, the legacy
Pharos319 protein set is compared against the newly defined canonical proteins in Pharos400.
More broadly, Pharos400:
canonical, isoform, or alternate_product
In the public-facing comparison of legacy Pharos319 proteins against the
canonical proteins in Pharos400, 2,137 proteins moved to a higher TDL category.
| Upgrade Transition | Count |
|---|---|
Tbio -> Tchem | 1,320 |
Tdark -> Tbio | 743 |
Tdark -> Tchem | 46 |
Tchem -> Tclin | 18 |
Tbio -> Tclin | 10 |
These upgrades reflect the updated Pharos400 protein model and the harmonized evidence carried into the released canonical comparison set.
Pharos400 uses:
Operationally, the pipeline downloads and harmonizes all human UniProt proteins in the UniProt database, then annotates them using Ensembl, RefSeq, NCBI Gene, and HGNC support.
For user-facing legacy comparison, the cleanest apples-to-apples view is:
canonical_isoform_status == canonicalWithin a uniprot_id group, the pipeline chooses the preferred representative in this priority order:
UniProtKB reviewed (Swiss-Prot))After per-UniProt grouping, rows can also be grouped across shared consolidated gene symbols when they represent related protein products. This is done conservatively so that empty-symbol proteins and unrelated proteins are not collapsed together.
canonical_isoform_statuscanonical: biologically supported canonical protein for a harmonized groupisoform: non-canonical protein with explicit isoform evidencealternate_product: non-canonical grouped product without positive isoform evidenceThe source of that label is recorded in canonical_isoform_status_source.
is_canonical: marks the biologically selected canonical protein in the harmonized groupprotein_group_anchor: marks the row used as the grouping or representative anchorA row can be an anchor without being a biologically canonical protein.
Pharos400 explicitly filters UniProt accessions associated with the current UniProt resource transition:
227,28660,453166,833UniProt has publicly disclosed this transition in its proteomes and UniProtKB/TrEMBL resources.
After removal filtering, the pipeline repairs orphaned canonical links so grouped proteins still point to a live representative.
The release metadata records the following source versions for the May 2026 build:
| Source | Version |
|---|---|
| Ensembl BioMart | 115 |
| NCBI Gene Info | 2026-05-06 |
| HGNC | 2026-05-05 |
| RefSeq (NCBI) | 2026-05-06 |
| UniProt | 2026_01 |
The public pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv files contain the following columns:
| Column | Description |
|---|---|
id | Stable NCATS protein identifier for the released protein row. |
uniprot_id | Primary UniProt accession used as the public-facing protein identifier. |
tdl | Assigned Target Development Level: Tclin, Tchem, Tbio, or Tdark. |
uniprot_reviewed | Whether the UniProt record is reviewed (Swiss-Prot) or unreviewed (TrEMBL). |
uniprot_annotationScore | UniProt annotation score carried forward for the protein record. |
name | Protein name used in the released table. |
xref | Cross-reference summary carried forward for the protein row. |
idg_family | IDG protein family assignment used in Pharos. |
uniprot_function | Functional description carried forward from UniProt. |
symbol | Gene symbol associated with the protein record. |
ncbi_id | Supporting NCBI Gene identifier. |
ensembl_id | Supporting Ensembl protein or transcript identifier summary. |
canonical_isoform_status | Protein grouping state: canonical, isoform, or alternate_product. |
uniprot_isoform | UniProt isoform identifier when available. |
tdl_ligand_count | Supporting ligand count used in TDL assignment. |
tdl_drug_count | Supporting drug count used in TDL assignment. |
tdl_go_term_count | Supporting GO-term count used in TDL assignment. |
tdl_generif_count | Supporting GeneRIF count used in TDL assignment. |
tdl_pm_score | Supporting PubMed-based score used in TDL assignment. |
tdl_antibody_count | Supporting antibody count used in TDL assignment. |
The public-facing Pharos comparison is Pharos319 versus the canonical proteins in Pharos400.
This is the clearest bridge from the legacy release to the new harmonized release because it avoids mixing the full isoform and alternate-product expansion into a legacy-to-current TDL comparison.
20,412 proteins in legacy Pharos31920,654 canonical proteins in Pharos40020,191 shared proteins between the two releases463 proteins new in Pharos400 relative to Pharos319221 proteins present in Pharos319 are not retained in the Pharos400 canonical comparison set due to obsolete UniProt IDs2,137 TDL upgrades relative to Pharos319
Within the 221 Pharos319 proteins not retained in the Pharos400 canonical comparison set,
117 are currently treated as truly obsolete in UniProt, 94 were reassigned
one-to-one to a replacement UniProt accession, and 10 were reassigned one-to-many.
This public comparison is based on the canonical protein subset of Pharos400.