Pharos400 Release Notes - and TDL updates

This page accompanies the public-facing pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv releases. It summarizes what changed relative to Pharos319, how canonical and isoform logic is assigned, and what the public TDL table contains.

Total Human Protein Rows 166,833
Canonical Proteins 20,654
Isoform + Alternate Product Rows 146,179

Release Summary

Unlike legacy Pharos319, which used a largely static set of 20,412 proteins, Pharos400 is protein-centric, provenance-tracked, and isoform-aware.

The public Pharos400 release is distributed as two complementary files:

This release includes both reviewed (Swiss-Prot) and unreviewed (TrEMBL) human UniProt accessions after applying the UniProt transition filtering now being disclosed by UniProt.

In practice, the Pharos400 model is designed to stay current with UniProt and other supporting protein data resources, rather than remaining tied to a one-time legacy protein set.

What Changed Compared to Pharos319

Pharos400 is not simply a larger copy of the old TDL table. For user-facing TDL comparison, the legacy Pharos319 protein set is compared against the newly defined canonical proteins in Pharos400. More broadly, Pharos400:

For legacy users, the biggest conceptual change is that Pharos400 is built from a full reviewed and unreviewed human UniProt protein download, with harmonized grouping logic layered on top.

TDL Upgrades Relative to Pharos319

In the public-facing comparison of legacy Pharos319 proteins against the canonical proteins in Pharos400, 2,137 proteins moved to a higher TDL category.

Upgrade Transition Count
Tbio -> Tchem1,320
Tdark -> Tbio743
Tdark -> Tchem46
Tchem -> Tclin18
Tbio -> Tclin10

These upgrades reflect the updated Pharos400 protein model and the harmonized evidence carried into the released canonical comparison set.

Identifier Standard

Pharos400 uses:

Operationally, the pipeline downloads and harmonizes all human UniProt proteins in the UniProt database, then annotates them using Ensembl, RefSeq, NCBI Gene, and HGNC support.

For user-facing legacy comparison, the cleanest apples-to-apples view is:

Canonical, Isoform, and Alternate Product Logic

1. Canonical representative selection

Within a uniprot_id group, the pipeline chooses the preferred representative in this priority order:

  1. UniProt canonical isoform flag
  2. reviewed UniProt entry (UniProtKB reviewed (Swiss-Prot))
  3. Ensembl canonical peptide evidence
  4. deterministic fallback on annotation and mapping scores

2. Symbol-group harmonization

After per-UniProt grouping, rows can also be grouped across shared consolidated gene symbols when they represent related protein products. This is done conservatively so that empty-symbol proteins and unrelated proteins are not collapsed together.

3. canonical_isoform_status

The source of that label is recorded in canonical_isoform_status_source.

4. Important distinction: canonical vs anchor

A row can be an anchor without being a biologically canonical protein.

Removal-List Filtering

Pharos400 explicitly filters UniProt accessions associated with the current UniProt resource transition:

UniProt has publicly disclosed this transition in its proteomes and UniProtKB/TrEMBL resources.

After removal filtering, the pipeline repairs orphaned canonical links so grouped proteins still point to a live representative.

Source Versions

The release metadata records the following source versions for the May 2026 build:

Source Version
Ensembl BioMart115
NCBI Gene Info2026-05-06
HGNC2026-05-05
RefSeq (NCBI)2026-05-06
UniProt2026_01

Data Dictionary

The public pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv files contain the following columns:

Column Description
idStable NCATS protein identifier for the released protein row.
uniprot_idPrimary UniProt accession used as the public-facing protein identifier.
tdlAssigned Target Development Level: Tclin, Tchem, Tbio, or Tdark.
uniprot_reviewedWhether the UniProt record is reviewed (Swiss-Prot) or unreviewed (TrEMBL).
uniprot_annotationScoreUniProt annotation score carried forward for the protein record.
nameProtein name used in the released table.
xrefCross-reference summary carried forward for the protein row.
idg_familyIDG protein family assignment used in Pharos.
uniprot_functionFunctional description carried forward from UniProt.
symbolGene symbol associated with the protein record.
ncbi_idSupporting NCBI Gene identifier.
ensembl_idSupporting Ensembl protein or transcript identifier summary.
canonical_isoform_statusProtein grouping state: canonical, isoform, or alternate_product.
uniprot_isoformUniProt isoform identifier when available.
tdl_ligand_countSupporting ligand count used in TDL assignment.
tdl_drug_countSupporting drug count used in TDL assignment.
tdl_go_term_countSupporting GO-term count used in TDL assignment.
tdl_generif_countSupporting GeneRIF count used in TDL assignment.
tdl_pm_scoreSupporting PubMed-based score used in TDL assignment.
tdl_antibody_countSupporting antibody count used in TDL assignment.

Public Release Comparison

The public-facing Pharos comparison is Pharos319 versus the canonical proteins in Pharos400.

This is the clearest bridge from the legacy release to the new harmonized release because it avoids mixing the full isoform and alternate-product expansion into a legacy-to-current TDL comparison.

Within the 221 Pharos319 proteins not retained in the Pharos400 canonical comparison set, 117 are currently treated as truly obsolete in UniProt, 94 were reassigned one-to-one to a replacement UniProt accession, and 10 were reassigned one-to-many.

This public comparison is based on the canonical protein subset of Pharos400.