genetics system snapshot · 2026-04-15

как реально работает genetics pipeline сейчас

тезис: pipeline уже НЕ хаос. raw provenance, run contracts, graph schema и agent guardrails есть. но система еще не замкнута: свежая truth уже частично ушла в `runs/ + state/ledgers`, тогда как `graph/` и review-loop не догнали.

raw tier 23andme + WGS VCF

graph nodes 63 valid

run folders 8 total

canonical complete 5 / 8

слои системы

не один pipeline, а 6 связанных слоев. хорошие части уже машинно-контрактные. слабое место не в извлечении SNP-ов, а в последнем мосте: от genetics outputs к operational truth и review cycle.

9/10 strong

raw + provenance

23andme raw + tellmegene VCF allowlisted
FASTQ сохранен, но еще `needs_integrity_check`
preflight валидирует sample mismatch и чужие VCF

7/10 good but uneven

runs / execution

8 run-папок, 5 полных canonical runs
energy_v2 и atlas частично заполнены
stage2 lipid folder пока вне полного contract

6/10 partial canon

graph / schema

63 nodes valid, 0 schema errors
4 intents, 20 axes, 39 genes
нет lipid/lpa/pharma nodes, хотя run truth уже есть

8/10 strong

agent control

manifest allowlist + excluded genome guard
`CLAIM_TYPES.md` тормозит overclaiming
machine-first v2 contract уже жесткий

5/10 bridge weak

state integration

pharmacogenomics ledger обновлен и usable
next_data уже закрывает часть старых genetics questions
общего run → ledger bridge contract пока нет

3/10 open gap

review / protocol loop

protocol_diff files есть
`review` nodes и generated protocol layer пустые
weekly revision loop задуман, но не институционализирован

end-to-end flow

это текущий реальный путь, если смотреть на репо без self-hypnosis. зеленое = работает. янтарное = работает, но не везде. красное = архитектурно обещано, но системно не закрыто.

implemented

1. raw intake

`raw/23andme_Matskevich.txt`
`raw/tellmegene_ULZEDCBC3541.vcf.gz`
manifest allowlist

implemented

2. provenance gate

`preflight_provenance_check.rb`
sample-id check
unexpected VCF rejection

mostly live

3. run artifacts

`00_goal` → `05_protocol_diff`
full in 5 runs
partial in atlas / energy_v2 / stage2

partial canon

4. graph sync

energy + iron + skin covered
lipid pharma not synced
`evidence/protocol/review` empty

manual bridge

5. human/state surface

views exist, но их мало
protocols and ledgers partly updated
state beats stale summaries

not institutionalized

6. review loop

rollback conditions written
review nodes absent
no canonical run→review automation

coverage matrix

где какие domains реально стоят. критичный разлом: lipid-pharmacogenomics уже decision-grade по runs и state, но formal graph-canon там еще не догнал.

domain	raw / ingest	run layer	graph layer	state / ledger	protocol / review	read
skin / matrix remodeling	legacy genes + WGS pass	builder-backed	axis + 7 genes	indirect only	view exists, review absent	best proof-of-concept for graph/view flow
energy v1 / v2	strong raw coverage	v1 complete, v2 near-complete	main graph backbone	not promoted as dedicated ledger	protocol diff yes, review no	most mature genetics-native domain
iron system	genetics + cross-domain intent	complete genetics run	intent draft	bridged into iron ledger only partially	protocol hypotheses exist	good mechanism layer, phenotype closure still open
full potential atlas	broad rsid extraction	missing protocol diff	not graph-synced	not operationalized	no formal review loop	good map, not yet a closed machine system
lipid / lpa / pharmacogenomics	strong raw + region annotation	v1 and v2 complete	graph gap	state ledger updated	protocol diff yes, review node no	truth outran architecture; this is the main structural debt
legacy per-gene / topics / maps	8 genes, 2 topics, 3 maps	ingress only	partly referenced	manual use	drift risk	valuable evidence base, but no longer canonical by policy

implemented vs weak vs missing

тут самое полезное: не “всё плохо / всё хорошо”, а где уже есть machinery, где semantic drift, и где просто дырка.

implemented

что уже реально собрано

manifest allowlist и provenance preflight
schema validator на 63 graph nodes
machine-first run contract и claim taxonomy
build scripts для iron, atlas, lipid pgx, skin axis
decision-grade tier из 23andme + tellmegene vcf
pharmacogenomics выводы уже подняты в `state/ledgers`

weak

что работает, но ломает консистентность

graph задекларирован как canon, но покрывает не все активные domains
часть run-ов incomplete, а один stage2 folder вообще не canonical
generated views почти отсутствуют: сейчас их по сути один
legacy evidence живо и полезно, но drift-check почти ручной
state integration идет ad hoc, а не через formal bridge

missing

чего по сути нет

`graph/evidence`, `graph/protocols`, `graph/reviews` nodes
run → graph → state автоматический promotion path
review scheduler / revalidation discipline
formal forensic escalation path FASTQ → BAM/CRAM/QC
single dashboard that always reflects latest genetics truth

agent layer

для агентов система уже намного лучше, чем обычный “копай markdown”. но агентам все еще не хватает одного жесткого моста: когда genetics run ДОЛЖЕН менять graph и когда он ДОЛЖЕН менять `state`.

agents already know how

что агент умеет делать сейчас

стартовать от manifest-а и не трогать чужие геномы
падать fast, если provenance грязный
писать machine artifacts first, narrative second
разделять `raw_fact`, `risk_inference`, `protocol_hypothesis`
держать human-first gene folders как evidence, а не как canon

agents still need this

что агенту сейчас не хватает

mandatory sync rule: new run completed → update graph or fail
mandatory bridge rule: decision changes → update `state/ledgers` or mark debt
review debt tracker: что устарело по age / phenotype mismatch
single query surface, где lipid pharma truth читается из graph, а не из scattered run + ledger

next frontier

frontier split in two. genetics-internal frontier = достроить pipeline. decision frontier = перестать переоценивать genetics там, где next_data уже сказал “теперь phenotype/state важнее”.

genetics-internal frontier

что достраивать в pipeline

graph sync для lipid / lpa / pharmacogenomics и atlas
завести `evidence`, `protocol`, `review` node-типы не на бумаге, а в проде
закрыть incomplete runs: `energy_v2`, `full_potential_atlas`, `lipid_pharmacogenomics_stage2`
formal run → state promotion contract
FASTQ integrity + forensic escalation only when ambiguity merits it

decision frontier

где genetics уже НЕ главный bottleneck

main lipid-pharma genetics ambiguity substantially closed
future agents should not reopen ABCG2 panic or hidden-FH row-count fear
higher-leverage open loops now live in phenotype/state: clean home BP, sleep-airway closure, iron-copper closure
то есть next leverage = less new SNP mining, more better state + measurement architecture

source anchors

этот view собран поверх текущих contracts и live state, не поверх старых summaries.

genetics canon

`genetics/PIPELINE_agent_machine_first_v2.md`
`genetics/_agent/MANIFEST.yaml`
`genetics/_schema/validate_graph.py`
`genetics/runs/README.md`
`genetics/CLAIM_TYPES.md`

state + decision context

`state/ledgers/pharmacogenomics.yaml`
`state/ledgers/lipids.yaml`
`state/next_data.yaml`
`state/agent_doctrine.md`
`state/INGEST_CONTRACT.md`