Pharos400 Release Notes and TDL Updates

This page accompanies the public-facing pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv releases. It summarizes what changed relative to Pharos319, how canonical, isoform, and alternate-product logic is assigned, and what the public TDL table contains.

Total Human Protein Rows 166,833
Canonical Proteins 20,654
Isoform + Alternate Product Rows 146,179

Release Summary

Unlike legacy Pharos319, which used a largely static set of 20,412 proteins, Pharos400 is protein-centric, provenance-tracked, and isoform-aware.

The public Pharos400 release is distributed as two complementary files:

This release includes both reviewed (Swiss-Prot) and unreviewed (TrEMBL) human UniProt accessions after applying the UniProt transition filtering now being disclosed by UniProt.

In practice, the Pharos400 MySQL-backed database and knowledge graph are designed to stay current with UniProt and other supporting protein data resources, rather than remaining tied to a one-time legacy protein set.

What Changed Compared to Pharos319

Pharos400 is not simply a larger copy of the old TDL table. For user-facing TDL comparison, the legacy Pharos319 protein set is compared against the newly defined canonical proteins in Pharos400. More broadly, Pharos400:

For legacy users, the biggest conceptual change is that Pharos400 is built from a full reviewed and unreviewed human UniProt protein download, with harmonized grouping logic layered on top.

Target Development Level Definitions

Target Development Levels (TDLs) summarize how much is known about a target's development and druggability. Pharos400 carries the four IDG TDL categories used by Pharos:

TDL Definition
Tclin Targets with approved-drug activities in DrugCentral and a known mechanism of action.
Tchem Targets with small-molecule activities in ChEMBL or DrugCentral that satisfy the activity thresholds below. Some targets may also be manually promoted to Tchem by expert curation from other small-molecule evidence.
Tbio Targets without qualifying drug or small-molecule activities that satisfy one or more biological-evidence criteria, including being above the Tdark evidence cutoffs or having experimentally supported Gene Ontology Molecular Function or Biological Process leaf-term annotation.
Tdark Targets without qualifying drug or small-molecule activities and with limited available evidence, defined by meeting two or more dark-target criteria: PubMed text-mining score < 5, GeneRIF count <= 3, or antibody count <= 50.

Activity thresholds

DrugCentral and ChEMBL activity values must be standardizable to -log molar units and must pass target-family-specific potency cutoffs:

Target Family Activity Cutoff
GPCRs<= 100 nM
Kinases<= 30 nM
Ion channels<= 10 uM
Non-IDG family targets<= 1 uM

Supporting evidence fields

ChEMBL ligand activity selection

ChEMBL activities are counted for TDL support only when they have a pChEMBL value, come from a binding assay, use a molecule with MOL structure type, are mapped to a SINGLE_PROTEIN target, have standard_flag = 1 with an exact standard relation, are associated with a publication, and pass the family cutoff: kinases <= 30 nM, GPCRs <= 100 nM, nuclear receptors <= 100 nM, ion channels <= 10 uM, and other targets <= 1 uM.

TDL Upgrades Relative to Pharos319

In the public-facing comparison of legacy Pharos319 proteins against the canonical proteins in Pharos400, 2,137 proteins moved to a higher TDL category.

Upgrade Transition Count
Tbio -> Tchem1,320
Tdark -> Tbio743
Tdark -> Tchem46
Tchem -> Tclin18
Tbio -> Tclin10

These upgrades reflect the updated Pharos400 database and knowledge graph, plus the harmonized evidence carried into the released canonical comparison set.

Identifier Standard

Pharos400 uses:

Operationally, the pipeline downloads and harmonizes all human UniProt proteins in the UniProt database, then annotates them using Ensembl, RefSeq, NCBI Gene, and HGNC support.

IFX identifier disclaimer: IFX/NCATS row IDs are included for internal traceability across this release, especially where isoforms or alternate products share the same uniprot_id. They are intended to support reproducible Pharos400 processing and comparison, not to serve as a new external identifier standard.

For user-facing legacy comparison, the cleanest apples-to-apples view is:

Canonical, Isoform, and Alternate Product Logic

UniProt defines the canonical sequence as the representative sequence displayed by default for an entry and used as the coordinate basis for positional annotation. For reviewed Swiss-Prot entries, UniProt selects this representative using evidence such as function, broad expression, conserved exons, agreement with other genome curation resources such as CCDS or MANE, or, when evidence is limited, the longest sequence. UniProt generally groups protein products from one gene in one species into a single entry when possible, including isoforms from alternative splicing, alternative promoter usage, or alternative translation initiation. See the UniProt help page on canonical sequences and isoforms.

Pharos400 uses those UniProt concepts as input evidence, then applies additional cross-source grouping rules so canonical, isoform, and alternate-product rows can be represented consistently in the database and knowledge graph.

1. Canonical representative selection

Within a uniprot_id group, the pipeline chooses the preferred representative in this priority order:

  1. UniProt canonical isoform flag
  2. reviewed UniProt entry (UniProtKB reviewed (Swiss-Prot))
  3. Ensembl canonical peptide evidence
  4. deterministic fallback on annotation and mapping scores

2. Symbol-group harmonization

After per-UniProt grouping, rows can also be grouped across shared consolidated gene symbols when they represent related protein products. This is done conservatively so that empty-symbol proteins and unrelated proteins are not collapsed together.

3. canonical_isoform_status

The source of that label is recorded in canonical_isoform_status_source.

4. Important distinction: canonical vs anchor

A row can be an anchor without being a biologically canonical protein.

Removal-List Filtering

Pharos400 explicitly filters UniProt accessions associated with the current UniProt resource transition:

UniProt has publicly disclosed this transition in its proteomes and UniProtKB/TrEMBL resources.

After removal filtering, the pipeline repairs orphaned canonical links so grouped proteins still point to a live representative.

Source Versions

The release metadata records the following source versions for the May 2026 build:

Source Version
Ensembl BioMart115
NCBI Gene Info2026-05-06
HGNC2026-05-05
RefSeq (NCBI)2026-05-06
UniProt2026_01

Data Dictionary

The public pharos400_tdls_canonicals.csv and pharos400_tdls_full.csv files contain the following columns:

Column Description
idInternal IFX/NCATS row identifier used for release traceability, including isoform and alternate-product tracking. This is not intended as a public identifier standard.
uniprot_idPrimary UniProt accession used as the public-facing protein identifier.
tdlAssigned Target Development Level: Tclin, Tchem, Tbio, or Tdark.
uniprot_reviewedWhether the UniProt record is reviewed (Swiss-Prot) or unreviewed (TrEMBL).
uniprot_annotationScoreUniProt annotation score carried forward for the protein record.
nameProtein name used in the released table.
xrefCross-reference summary carried forward for the protein row.
idg_familyIDG protein family assignment used in Pharos.
uniprot_functionFunctional description carried forward from UniProt.
symbolGene symbol associated with the protein record.
ncbi_idNCBI Gene identifier.
ensembl_idEnsembl protein or transcript identifier summary.
canonical_isoform_statusProtein grouping state: canonical, isoform, or alternate_product.
uniprot_isoformUniProt isoform identifier when available.
tdl_ligand_countLigand count used in TDL assignment.
tdl_drug_countDrug count used in TDL assignment.
tdl_go_term_countGO-term count used in TDL assignment.
tdl_generif_countGeneRIF count used in TDL assignment.
tdl_pm_scorePubMed-based score used in TDL assignment.
tdl_antibody_countAntibody count used in TDL assignment.

Public Release Comparison

The public-facing Pharos comparison is Pharos319 versus the canonical proteins in Pharos400.

This is the clearest bridge from the legacy release to the new harmonized release because it avoids mixing the full isoform and alternate-product expansion into a legacy-to-current TDL comparison.

Within the 221 Pharos319 proteins not retained in the Pharos400 canonical comparison set, 117 are currently treated as truly obsolete in UniProt, 94 were reassigned one-to-one to a replacement UniProt accession, and 10 were reassigned one-to-many.

This public comparison is based on the canonical protein subset of Pharos400.