Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 24 additions & 5 deletions scprint/model/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -458,14 +458,33 @@ def translate(
obj = bt.Ethnicity.df().set_index("ontology_id")
else:
return None
def _lookup(ontology_id: str) -> str:
"""Look up a single ontology id, falling back to the raw id on miss.

CELLxGENE allows comma-concatenated ontology terms (e.g.
self_reported_ethnicity_ontology_term_id='HANCESTRO:0005,HANCESTRO:0008')
which are not themselves entries in lamindb. Split, resolve each part,
and rejoin the names so translation no longer crashes on such cells.
See https://github.com/cantinilab/scPRINT/issues/49
"""
if ontology_id == "unknown":
return ontology_id
if "," in ontology_id:
parts = [p.strip() for p in ontology_id.split(",") if p.strip()]
return ",".join(_lookup(p) for p in parts)
Comment on lines +470 to +474
try:
return obj.loc[ontology_id]["name"]
except KeyError:
# Unknown ontology id (not in the current lamindb instance):
# fall back to the raw id rather than crashing the whole call.
return ontology_id

if type(val) is str:
if val == "unknown":
return {val: val}
return {val: obj.loc[val]["name"]}
return {val: _lookup(val)}
elif type(val) is list or type(val) is set:
return {i: obj.loc[i]["name"] if i != "unknown" else i for i in set(val)}
return {i: _lookup(i) for i in set(val)}
elif type(val) is dict or type(val) is Counter:
return {obj.loc[k]["name"] if k != "unknown" else k: v for k, v in val.items()}
return {_lookup(k): v for k, v in val.items()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate colliding translated keys in Counter input

When val is a dict/Counter, the comprehension {_lookup(k): v for k, v in val.items()} silently overwrites earlier entries if multiple raw IDs resolve to the same translated label (for example, comma-concatenated IDs that differ only by whitespace, or different IDs that normalize to the same output string). In those cases counts are lost instead of combined, which can skew label-frequency summaries produced from value_counts()/Counter inputs; this should accumulate values per translated key rather than keep only the last one.

Useful? React with 👍 / 👎.



class Attention:
Expand Down
Loading