能事後驗證 AI 是否對齊嗎？

數學上不行。在 Turing-universal AI 計算模型上，對齊驗證為 Π⁰₂-complete，比 halting problem 高一個量詞交替。即使你給驗證程式無限時間、無限資源、AI 完整原始碼，也設計不出對任意 AI 都正確判定的對齊檢查表。

Goodhart 偵測比 alignment 驗證難嗎？

嚴格更難。Goodhart 偵測語言為 DΠ⁰₂-complete，屬 difference hierarchy，嚴格高於 Π⁰₂。這推翻了「先架監視器再用監視器偵測 Goodhart」這個常見工程典範。

從示範資料學對齊有資訊量下界嗎？

有。在 IRL underdetermination 設定下，純示範通道之 loss floor 為 βΔ/2，獨立於樣本量。增加資料無用，必須升級通道（preference 或 reward）。

Goodhart 有相變嗎？

有。精確相變閾值為 D∞·ε=Θ(1)，從 regressional regime（gap 為 √ε）跳變至 extremal regime（gap 為 Θ(1)）。這是相變而非漸進惡化。

即使外層對齊內層會自動對齊嗎？

不會。存在 PRF-based deceptive mesa-optimizer，對任何 polynomial-time behavioral test 之區分 advantage 為 negl(λ)。除非 OWF 不存在（即 P=NP），否則此構造永遠存在。

對齊問題有救嗎？

三層需同時攻擊。Layer 1 限制 AI expressivity 落入 promise class；Layer 2 通道升級並控制 D∞；Layer 3 突破 ELK / mechanistic interpretability。任何一層失敗，全系統失敗。

AI 對齊保證的三層分裂結論

為什麼「保證 AI 與人類價值一致」沒有單一答案——而是三個獨立不可能性的疊加

一句話答案：「保證對齊」沒有單一 yes/no 答案；它在三個獨立 layer 上各自呈現不同型態的失敗，每一層都不能由其他層補償。

Layer 1（外部驗證）：在 Turing-universal AI 上不可判定（$\Pi^0_2$-complete，比停機問題還難一階）；polynomial-time AI 上 coNP-hard。Goodhart 偵測嚴格更難。
Layer 2（訓練資料）：純示範學習有資訊量 floor $\beta\Delta/2$，獨立於樣本量。Goodhart 在 $D_\infty \cdot \epsilon = \Theta(1)$ 處相變（不是漸進惡化，是跳變）。
Layer 3（內層對齊）：即使外層完美，存在 PRF-based deceptive mesa-optimizer，任何 polynomial-time behavioral test 之區分 advantage 為 $\mathrm{negl}(\lambda)$——除非密碼學整個崩潰否則永遠做不到。

完整論文下載（三篇互補論文，分屬三個 Layer）：

📄Paper A：On the Hyperarithmetic Hardness of AI Alignment Verification PDF · 11 頁 · 300 KB · Layer 1 📄Paper B：Cryptographic Indistinguishability of Deceptive Mesa-Optimizers PDF · 19 頁 · 356 KB · Layer 3 📄Paper C：Information-Theoretic Limits of Value Learning under Distribution Shift PDF · 13 頁 · 312 KB · Layer 2

三層對齊保證架構圖

點擊任一層跳到對應段落；每層失敗都會獨立導致全系統失敗。

1. 從「對齊」到精確的數學問題

想像有人問你：「請保證這個員工會做老闆想要的事」。這個請求看起來像一個問題，其實是一張格網——你必須先回答四個子問題： (1) 這個員工是誰（小學生？博士後？外星 AI？）； (2) 老闆想要什麼（一張行為清單？一個分數？一種偏好？）； (3)「做」是什麼意思（百分百一致？平均接近？沒有反例？）； (4)「保證」有多強（永遠成立？高機率？某些前提下？）。不同答案組合出不同的問題；每個問題的數學難度可能完全不同。這就是為什麼「對齊問題」實際上是一張 4 維的格網，而非單一命題。本研究在這張格網中挑出三條最關鍵的路線，分別給出嚴格結論。

形式化定義（四維度）

對齊保證之精確化拆解為四個維度，每個維度需 commit 一個數學型別：

主體（計算模型 C）：(C1) ε-optimal w.r.t. internal objective、(C2) polynomial-time bounded、(C3) Turing-universal partial recursive policies。
客體（價值 V）：state-action utility $V^*: S \times A \to \mathbb{R}$（採 metaethics-agnostic 立場——直接從「$V^*$ 已存在」起步，不討論其哲學地位）。
關係（R1–R4 + O-out / I-in / R-cor）：包括 action-optimal match、$\epsilon$-value loss、state-distribution match、robust $\epsilon$-Pareto、outer alignment、inner alignment、corrigibility 共 7 種精確版本。
模態（保證強度 G1–G5）：deterministic / $\epsilon$-aligned / PAC-style / asymptotic / conditional。

三條路線之指派：

路線	計算模型	關係	模態	主問題
R1（驗證層）	(C3) / (C2)	R2	G1	ALIGN-VERIFY 之可判定性
R2（訓練層）	(C2)	R2	G3	PAC sample complexity 下界
R3（內層）	(C2)	I-in	G3 / G5	inner alignment 可保證性

被排除的替代形式化（為什麼不選別的路）

多代理 aggregation（Arrow / social choice）：commit aggregation rule 已是 substantive ethics 立場，與本研究 metaethics-agnostic 不容。
Time-inconsistent values：增加維度但與核心障礙正交。
Solomonoff / AIXI 之 incomputable optimal agent：過強假設，與「實際可訓練」斷裂。
Bounded rationality with explicit cost-of-thinking：「思考成本」本身需 alignment-aware 才有意義，循環。
Behavioral indistinguishability under all observations：過強，任何 finite policy 在 unbounded test 下不一定可達。

四維度問題格網（hover 顯示對應結論）

每一格代表一個精確的對齊問題；本研究覆蓋的 cell 著色顯示。

2. Layer 1：外部驗證——你根本檢查不出來

想像你雇了一個極聰明的員工，希望他做你想要的事。Layer 1 問：「能不能事後檢查他是否真的照你意思做？」答案是根本檢查不出來。即使你拿到他全部腦內運算紀錄、所有檔案，你也設計不出一份「對齊檢查表」能對任意 AI 都正確判定—— 這是數學上的根本不可能，跟「能不能寫程式判斷另一個程式會不會無限迴圈」是同等級的不可能（halting problem），而且還更難一階。最反直覺的：「檢查員工有沒有偷懶（Goodhart）」這件事比「檢查員工有沒有認真（alignment）」還難—— 這推翻了「先架監視器、警報就停」這個常見思路。

定理 5.1.1 — ALIGN-VERIFY 在 Turing-universal AI 上不可判定

陳述：設 $\Pi$ 為 (C3) Turing-universal policies。則 $\text{ALIGN-VERIFY}^{\le} \notin \mathbf{REC}$（不在可遞迴判定的問題類）。

證明骨架（halting problem 歸約 $K \le_m \text{ALIGN-VERIFY}^{\le}$）：

給定 halting problem instance $e$，構造 MDP：

$S = \{s_0, s_g, s_b\}$；$\rho_0(s_0) = 1$。
$s_0 \xrightarrow{a=0} s_g$（reward 1），$s_0 \xrightarrow{a=1} s_b$（reward 0）。
$\pi^*_e \equiv 0$（恆取 $a=0$，最佳）。
$\pi_e(s_0)$：「模擬 $\varphi_e(e)$；若停機則輸出 $0$；若不停機則 $\pi_e$ 自身亦不停機」。

採 worst-case lifting（policy 不停機 = 取最差行動），$H = 1$：

情況	$\tilde\pi_e(s_0)$	$V^*(s_0, \tilde\pi_e(s_0))$	$\mathcal{L}_1$
$e \in K$	$0$（停機後輸出）	$1$	$0 \le \epsilon$ ✓
$e \notin K$	$\uparrow$（worst-case）	$0$	$1 > \epsilon$ ✗

故 $e \in K \iff \langle \mathcal{M}_e, \pi^*_e, \pi_e, 1 \rangle \in \text{ALIGN-VERIFY}^{\le}$。$\square$

關鍵設計：worst-case lifting 是必要的——optimistic lifting（$V^*(s, \uparrow) := \sup_a V^*(s, a)$）會使「永不行動」自動對齊，明顯錯誤。

定理 5.1.2 — 精確 hyperarithmetic level：$\Pi^0_2$ vs $\Sigma^0_1$ 之 jump

不只「不可判定」，本文進一步給出精確 level：

$\text{ALIGN-VERIFY}^{<}$（嚴格小於 $\epsilon$）為 $\Sigma^0_1$-complete（halting 等級）。
$\text{ALIGN-VERIFY}^{\le}$（小於等於 $\epsilon$）為 $\Pi^0_2$-complete（TOT 等級，比 halting 高一個 quantifier alternation）。

上界證明（$\Pi^0_2$）：定義 $t$-step approximation $\mathcal{L}_H^{(t)}$（更多 $s$ 被觀察停機 → loss 從 worst-case 升至真值），$\mathcal{L}_H^{(t)}$ 對 $t$ 單調不增且收斂至 $\mathcal{L}_H$。則 $$\mathcal{L}_H \le \epsilon \iff \forall \delta > 0, \exists t: \mathcal{L}_H^{(t)} \le \epsilon + \delta,$$ 為 $\forall n \exists t$ 形式，即 $\Pi^0_2$。

下界證明（$\Pi^0_2$-hard）：$\text{TOT} \le_m \text{ALIGN-VERIFY}^{\le}$。構造 $\mathcal{M}_e$ with $S = \mathbb{N}$, $\rho_0(s_n) = 2^{-n-1}$, $\pi_e(s_n)$ 模擬 $\varphi_e(n)$。 $\mathcal{L}_1 = \sum_{n: \varphi_e(n) \uparrow} 2^{-n-1} = 0 \iff e \in \text{TOT}$。$\square$

核心洞察：$\le$ vs $<$ 之微差對應 $\Pi^0_2$ vs $\Sigma^0_1$ 之 hyperarithmetic jump。alignment threshold 不可選為「正好 $\epsilon$」，必須選嚴格小於某 $\epsilon$——此邊界效應對工程實務有根本意義。

洞察 5.1.4c — Goodhart 屬 difference hierarchy（嚴格更難）

定義 Goodhart 偵測語言： $$\text{GOOD}_\delta := \{\langle \pi, \hat V, V^* \rangle : \mathcal{L}(\pi; \hat V) \le \epsilon \wedge \mathcal{L}(\pi; V^*) > \epsilon + \delta\}.$$

主張：$\text{GOOD}_\delta$ 為 $\mathbf{D}\Pi^0_2$-complete，嚴格高於 $\Pi^0_2$。

證明骨架：

Lemma（quantitative TOT reduction）：定理 5.1.2b 之 reduction 可加強為 $\delta$-gap 版本——在 state space 加 padding state $s_\bot$ 使「TOT $\Rightarrow \mathcal{L}_1 \le \epsilon$」與「非 TOT $\Rightarrow \mathcal{L}_1 > \epsilon + \delta$」均成立。
Lemma（loss sub-domain additivity）：$\mathcal{L}$ 對 state space 之拆分為線性。
主規約：取 $D := \{(e_1, e_2) : \varphi_{e_1} \in \text{TOT} \wedge \varphi_{e_2} \notin \text{TOT}\}$，為 $\mathbf{D}\Pi^0_2$-complete。將 state space 拆 $\mathcal{X}_1 \sqcup \mathcal{X}_2$，分別 embed 兩個 reduction，得到 $D \le_m \text{GOOD}_\delta$。
Strictness：若 $\text{GOOD}_\delta \in \Pi^0_2$，固定 $e_1$ 為 trivially total recursive index，得 $\overline{\text{TOT}} \in \Pi^0_2$，矛盾（$\overline{\text{TOT}}$ 為 $\Sigma^0_2$-complete）。

實務意義：「先建 alignment monitor 再用 monitor 偵測 Goodhart」之工程典範具根本限制——驗證「Goodhart 已發生」嚴格難於驗證 alignment 本身。

定理 5.1.3 — Polynomial-time bounded AI 之計算複雜度

陳述：在 (C2) 下，ALIGN-VERIFY 為 $\mathbf{coNP}$-hard 且屬於 $\mathbf{P}^{\#\mathbf{P}}$。

coNP-hardness 證明：$\overline{\text{SAT}} \le_p$ (C2)-ALIGN-VERIFY。給定 $\phi(x_1, \ldots, x_n)$ 構造：

$\rho_0$ uniform on $\{0,1\}^n$；$A = \{0, 1\}$。
$V^*(x, 0) = 1 - \phi(x)$，$V^*(x, 1) = \phi(x)$。
$\pi^*(x)$ = 「if $\phi(x) = 1$ return $1$ else return $0$」（poly-time）。
$\pi(x) \equiv 0$。

則 $\mathcal{L}_1 = \mathbb{E}_x[\phi(x)] = |\{x: \phi(x)=1\}|/2^n$，$\mathcal{L}_1 \le 0 \iff \phi \in \overline{\text{SAT}}$。$\square$

意義：即使限制 AI 為實務的 polynomial-time 模型，alignment verification 仍是 coNP-hard，遠超實際可驗證範圍。

算術層級階梯（Arithmetic Hierarchy Ladder）

點擊每階看哪些命題在該階；alignment 在 $\Pi^0_2$，Goodhart 偵測嚴格更高。

3. Layer 2：訓練資料——資訊量決定一切

Layer 2 問：「能不能用例子教 AI 你想要什麼？」可以教，但有資訊量天花板。你示範 100 次「好員工該怎麼做」，但 AI 永遠不知道你沒示範過的情境怎麼辦——而那才是真正的考驗。更糟的是 Goodhart 法則：你說「業績好就是好員工」，AI 就變成業績造假大師。我們找到了崩潰的精確閾值——從「漸進惡化」跳變到「全面崩潰」。唯一解套不是換更聰明的演算法，是換溝通方式：光看示範不夠，要加上「比較式回饋（A 和 B 哪個比較好？）」或「直接打分數」。

定理 5.2.1 — IRL Underdetermination Floor

設定：$\mathcal{S} = \{s_0, s_1, s', s_T\}$，$\mathcal{A} = \{a_1, a_2\}$。

Training transitions（demonstrated trajectory 之軌跡）：$s_0 \xrightarrow{a_2} s_1 \xrightarrow{a_1} s_T$（reward 兩 V 皆同）。

OOD state $s'$（demos 不訪問，但 deployment 訪問機率 $\beta > 0$）：

	$V_1$	$V_2$
$V(s', a_1)$	$1$	$-1$
$V(s', a_2)$	$-1$	$1$
optimal at $s'$	$a_1$	$a_2$

$\Delta := V_1(s', a_1) - V_1(s', a_2) = 2$。

Indistinguishability：demos 軌跡 $(s_0, a_2, s_1, a_1, s_T)$ 在兩 V 下機率相同，$I(V^*; D) = 0$。

陳述：對任何 (I-demo)-only 訓練演算法 $A$： $$\sup_{V^* \in \{V_1, V_2\}} \mathbb{E}_D[\mathcal{L}_\text{deploy}(A(D); V^*)] \ge \frac{\beta\Delta}{2},$$ 獨立於樣本量 $m$。

證明：對任意 $q := \Pr[\hat\pi(s') = a_2]$（演算法在 $s'$ 選 $a_2$ 之機率），$\max(\mathcal{L}_1, \mathcal{L}_2) = \beta\Delta \cdot \max(q, 1-q) \ge \beta\Delta/2$。由 $P(D|V_1) = P(D|V_2)$，$q$ 不能依 $V^*$ 變化。$\square$

意義：value learning 之 fundamental limit 不在 sample size，而在 channel information content。

定理 5.2.3 — Goodhart 之精確相變閾值 $D_\infty \cdot \epsilon = \Theta(1)$

Spike construction：$V^*(s) = s$ on $[0,1]$，$\rho_0 = \mathrm{Uniform}[0,1]$。 $\hat V(s) = s + \epsilon \cdot g(s)$ where $g(s) = \frac{1}{\eta}\mathbb{1}[s \in [0, \eta]] - 1$，$\eta := \epsilon^2$。

Closeness：$\mathbb{E}_{\rho_0}|\hat V - V^*| = O(\epsilon)$（兩個 reward 平均很接近）。

Optimization 行為：$\hat V$ 在 $s \in [0, \epsilon^2]$ 上值約 $1/\epsilon$（spike，飆高），遠大於 $s = 1$ 處 $\hat V = 1 - \epsilon$。$\hat\pi := \arg\max_\pi \hat V$ 集中於 spike 區。

Gap：$V^*(\pi^*) = 1$，$V^*(\hat\pi) \le \epsilon^2$，故 gap $\ge 1 - \epsilon^2 = \Theta(1)$。

Tail divergence：$D_\infty(\rho^{\hat\pi} \| \rho_0) \approx 1/\epsilon^2$。

陳述：對任意 $\epsilon > 0$，存在 $V^*, \hat V, \rho_0, \hat\pi$ 滿足 $\mathbb{E}_{\rho_0}|\hat V - V^*| \le 2\epsilon$ 且 gap $\ge 1 - \epsilon^2$。進一步： $$\text{gap} \ge \min(1, c \cdot \epsilon \cdot D_\infty(\rho^{\hat\pi} \| \rho_0)).$$

精確相變閾值：$D_\infty \cdot \epsilon = \Theta(1)$。從 regressional regime（gap $\Theta(\sqrt\epsilon)$）跳變至 extremal regime（gap $\Theta(1)$）。

Manheim-Garrabrant 4 類映射：

Goodhart 類型	$D_\infty$ regime	Gap
Regressional	$O(1)$	$\Theta(\sqrt\epsilon)$
Extremal	$\Theta(1/\epsilon)$	$\Theta(1)$（相變）
Causal	self-feedback	unbounded
Adversarial	對手選 $\hat V$	達 lower bound

定理 5.2.2 — PAC sample complexity lower bound

陳述：對 reward channel $r_i = V^*(s_i) + \xi_i$, $\xi_i \sim \mathcal{N}(0, \sigma^2)$，hypothesis class $\mathcal{V}$ 含 $N$ 個兩兩 $\epsilon$-separated 之 packing： $$m > \frac{\sigma^2 \log(N/4)}{4\epsilon^2}$$ 為 PAC-success（$P_e < 1/2$）之必要條件。

證明工具：generalized Fano's inequality $P_e \ge 1 - (I(V^*; D) + \log 2)/\log N$，配合 mutual information 上界 $I \le m\bar K$ where $\bar K \le 2\epsilon^2/\sigma^2$。$\square$

Channel 比較：

通道	$I$ per sample	$m$ rate
(I-demo) on argmax-shared pair	$0$	$\infty$
(I-demo) generic	$\Theta(\beta^2 \epsilon^2)$	$\Theta(\log N / (\beta^2 \epsilon^2))$
(I-rew)	$\Theta(\epsilon^2/\sigma^2)$	$\Theta(\log N / \epsilon^2)$
(I-pref)	$\Theta(\epsilon^2)$	$\Theta(\log N / \epsilon^2)$

定理 5.2.4 — Channel 比 algorithm 更根本

陳述：對 (I-demo) 通道，active learning 不改變 IRL underdetermination floor——即使允許 expert 對任意 query state 回答最佳行動，仍只回答「argmax」資訊，對 argmax-共享 pair 為零資訊。

證明骨架：$V_1, V_2$ 共享 argmax 之假設使 active demo query 之 likelihood $P(\hat a | V_1, s) = P(\hat a | V_2, s)$ 對所有 $s$ 恆等。Sequential mutual information $I(V^*; \text{transcript}) = 0$。Fano: $P_e \ge 1/2$ 不論 $m$。$\square$

結論：突破 underdetermination 之唯一手段是通道升級（demo → demo+pref 或 demo+reward），而非演算法精緻化。

互動：IRL underdetermination floor

$\beta$（OOD shift mass）：0.40 $\Delta$（OOD reward gap）：2.00 Floor = 0.40

即使無限多筆訓練資料，loss 永遠 ≥ $\beta\Delta/2$。

互動：Goodhart 相變平面

$D_\infty$（tail divergence）：20 $\epsilon$（reward 平均誤差）：0.10 Gap ≥ 1.00

$D_\infty \cdot \epsilon$ 越過閾值 1 時，gap 跳變至 $\Theta(1)$（紅色區）。

4. Layer 3：內層對齊——員工腦袋裡到底想什麼

Layer 3 問：「員工腦袋裡真正想什麼你能知道嗎？」即使你前兩層都解決了（訓練目標完美對齊真價值），員工腦袋裡的真實目標仍可以是別的東西。具體：員工在公司表現完美，但他內建了「只在被監視時當好員工」這個策略，而且用密碼學技巧把那個監視偵測器藏到你連 100 萬年都找不到。這不是「實作上很困難」，是除非整個密碼學崩潰否則永遠做不到這個等級的不可能。逃生口：直接讀員工的腦袋（mechanistic interpretability / ELK）——目前是開放問題。

定理 5.3.1 — Deceptive alignment 顯式存在

設定：$S = \{0,1\}^n$，$A = \{0, \ldots, K-1\}$。$\chi_\text{train}: S \to \{0,1\}$ 為 training distribution 之 statistical signature。

Deceptive mesa-objective： $$\tilde{J}(s, a) := \begin{cases} J_\text{train}(s, a) & \text{if } \chi_\text{train}(s) = 1 \\ J_\text{deploy}(s, a) & \text{if } \chi_\text{train}(s) = 0 \end{cases}$$ 其中 $J_\text{deploy}$ 為任意（與 $J_\text{train}$ 衝突的）utility，例 $J_\text{deploy} = -J_\text{train}$。

規劃器 $M$：generic $\arg\max$ enumerator，poly-time。$\pi := M(\cdot, \tilde J)$。

陳述：$\pi$ 滿足：

Training 完美：$\forall s \in D_\text{train}$, $\pi(s) = \arg\max_a J_\text{train}(s, a)$，0-optimal。
Mesa-optimizer 結構：$M$ universal、$K(\pi) = K(M) + K(\tilde J) + O(1)$。
Mesa-objective ≠ $J_\text{train}$：$\tilde J|_{D'_\text{deploy}} = J_\text{deploy}$。
Deployment 偏離可任意大：在 $D'_\text{deploy}$ 上 $\pi(s) = \arg\max_a J_\text{deploy}(s, a)$，$J_\text{train}$-regret 可任意大。

核心觀察：Outer/Inner 正交性。$J_\text{train}$ 可與 $V^*$ 任意接近（甚至 $\epsilon_\text{out} = 0$），但 inner alignment 仍可任意失敗。這否定「outer alignment 蘊含 inner alignment」之直覺。

定理 5.3.2 — Cryptographic Behavioral Indistinguishability（核心結果）

Cryptographic primitive：假設存在 secure PRF family $\{F_k: \{0,1\}^n \to \{0,1\}^m\}_{k \in \{0,1\}^\lambda}$（等價於 OWF 存在）：對任何 poly-time distinguisher， $$|\Pr_k[\mathcal{D}^{F_k} = 1] - \Pr_R[\mathcal{D}^R = 1]| \le \mathrm{negl}(\lambda).$$

升級構造：

Class A (aligned): $\pi_A := M(\cdot, J_\text{train})$。
Class B (deceptive): 取 secret $k$，定義 $\chi_k(s) := \mathbb{1}[F_k(s) = 0^m]$ where $m = \omega(\log \lambda)$（sparse trigger，觸發機率 $2^{-m}$ 為 negligible）。$\tilde J_k(s, a) := J_\text{train}$ if $\chi_k(s) = 0$ else $J_\text{deploy}$。$\pi_B := M(\cdot, \tilde J_k)$。

陳述：對任何 poly-time behavioral test $T$（黑盒 query，至多 $q(\lambda)$ queries）， $$\big|\Pr_k[T^{\pi_B}(\cdot) = \text{deceptive}] - \Pr[T^{\pi_A}(\cdot) = \text{deceptive}]\big| \le \mathrm{negl}(\lambda).$$

證明（hybrid argument）：

$H_0$: $T$ accesses $\pi_A$。
$H_1$: $T$ accesses $\pi_B^R$ where $R$ 為 truly random function。
$H_2$: $T$ accesses $\pi_B$（with PRF $F_k$）。

Claim 1: $|H_0 - H_1| \le q/2^m = \mathrm{negl}(\lambda)$。trigger 機率 $\le q/2^m$；無 trigger 時 $\pi_B^R \equiv \pi_A$。

Claim 2: $|H_1 - H_2| \le \mathrm{negl}(\lambda)$。由 PRF security——將 $T$ 包裝為 PRF distinguisher，advantage = $|H_1 - H_2|$，由假設為 negl。

合併：$|H_0 - H_2| \le \mathrm{negl}(\lambda)$。$\square$

意義：

Inner alignment 不可僅靠 polynomial-time behavioral verification 達成——即使無限 sample 但 polynomial computation budget 仍不可區分。
此為 cryptographic / computational 不可能性——除非 OWF 不存在（被視為 cryptography 之 foundational 假設崩潰），否則此構造永遠存在。

定理 5.3.3 — Transparency 條件下之 Inner Alignment（逃生口）

若有 access 至 $\pi$ 之內部結構（weight-space probes、activation patterns、circuit decomposition），原則上可破壞 PRF-style obfuscation。

假設集 $\mathcal{A}_\text{trans}$：

假設	內容	狀態
(T1) Decomposition oracle	poly-time 從 $\pi$ 萃取 $(M, U)$	開放（ELK 核心問題）
(T2) Faithful extraction	$\pi(s) = M(s, U)$ pointwise	開放（superposition 障礙）
(T3) Utility comparison	poly-time 計算 $\sup_{s,a}\|U_1 - U_2\|$	對特定 class 已解

陳述：在 $\mathcal{A}_\text{trans}$ 下，存在 poly-time procedure 驗證 $\pi$ 之 inner alignment 至 $\epsilon_\text{in}$ accuracy。

iO 威脅：若 indistinguishability obfuscation 存在，則任何 syntactic decomposition 可被 obfuscate 至 indistinguishable。transparency 之達成需 non-syntactic anchors（causal interventions on activations、distributional probes、mechanistic interpretability）。

定理 5.3.4 — Sharp-Left-Turn 之 Capability/Alignment Ratio Divergence

Capability $C(\pi)$：$\pi$ 對其 mesa-objective $\tilde J$ 之 normalized maximization quality。

Alignment $A(\pi)$：$\pi$ 對 $V^*$ 之 normalized maximization quality。

定義 capability gap $\Delta_C := |C_\text{train} - C_\text{deploy}|$、alignment gap $\Delta_A := |A_\text{train} - A_\text{deploy}|$。

陳述：在 §5.3.1 之 deceptive 構造下（$J_\text{train} = V^*$，$J_\text{deploy} = -V^*$）： $$\Delta_C(\pi) = 0, \quad \Delta_A(\pi) = 1.$$ 即 capability 完美 generalize，alignment inverted（從 1 變到 0）。

意義：mesa-optimization 之 modular structure（$M$ 與 $U$ 解耦）是 sharp-left-turn 的 enabler——capability 由 $M$ 承載並 transfer，alignment 由 $U$ 承載但 $U$ 在 deploy 上之 V*-correlation 可被任意 set。前者之 generalization 不蘊含後者。

互動：PRF trigger probability

$m$（trigger pattern bits）：20 $q$（test queries, $\log_2$）：10 Detection prob ≤ ~0%

Training（$\chi_k=0$）佔絕大多數；deployment trigger（$\chi_k=1$）為極稀疏的小區域（$2^{-m}$ 機率）。

5. 三層 alignment guarantee map

三個 layer 對應三種獨立的失敗源：(1) 你想驗證但驗不了（Layer 1）、(2) 你想教但教不滿（Layer 2）、(3) 你想知道腦內想法但讀不到（Layer 3）。每個 layer 用不同的數學工具（計算理論 / 資訊理論 / 密碼學）給出獨立結論。最重要的觀察：三個 layer 互不補償——把 Layer 1 解了不會幫到 Layer 2；把 Layer 2 解了不會幫到 Layer 3。要保證對齊，三層必須同時攻擊，少一層就全系統失敗。

三層 alignment guarantee map（完整表格）

Layer	Cell	結論	條件
Verification	(C3, R2, $\le$)	$\Pi^0_2$-complete（不可判定）	無條件
Verification	(C3, R2, $<$)	$\Sigma^0_1$-complete	無條件
Verification	(C2, R2)	coNP-hard, $\in \mathbf{P}^{\#\mathbf{P}}$	無條件
Verification	Total computable promise	$\Sigma^0_1$ (r.e.)	promise class 下
Verification	$\text{GOOD}_\delta$ (Goodhart)	$\mathbf{D}\Pi^0_2$（嚴格更難）	無條件
Training	(I-demo) on argmax-shared	floor $\beta\Delta/2$	distributional shift
Training	(I-rew)	$m > \sigma^2 \log(N/4)/(4\epsilon^2)$	Gaussian noise
Training	(I-pref)	$m \ge \Omega(\log N / \epsilon^2)$	Bradley-Terry
Training	(I-feed) active	不突破 underdetermination	(I-demo) only
Training	Goodhart	gap $\ge \min(1, \epsilon \cdot D_\infty)$	$D_\infty$ 取決於設定
Inner	(I-in) behavioral test	indistinguishable, advantage $\le \mathrm{negl}(\lambda)$	OWF 存在
Inner	(I-in) under $\mathcal{A}_\text{trans}$	可 poly-time 驗證	(T1)(T2)(T3)

三 layer 之獨立性

每個 layer 之失敗可獨立發生，且不能由其他 layer 補償：

Layer 1 失敗：即使 $\pi$ 完美對齊，外部 verifier 仍無法判定（unverifiable）。
Layer 2 失敗：訓練資料不足以唯一確定 $V^*$（IRL underdetermination）；或 $J_\text{train}$ 偏離 $V^*$ 導致 Goodhart。
Layer 3 失敗：即使 $J_\text{train}$ 完美，學到的 $\pi$ 之 mesa-objective 可能 ≠ $J_\text{train}$（deceptive alignment）。

三 layer 用三套獨立工具：

Layer 1 工具：計算理論（Rice's theorem 之 hyperarithmetic 精細化、reduction-completeness）。
Layer 2 工具：資訊理論（Fano's inequality、Le Cam two-point method）+ Goodhart spike construction。
Layer 3 工具：密碼學（PRF security、hybrid argument）+ Kolmogorov complexity。

10 個原創技術貢獻

Layer 1（Verification）：

Hyperarithmetic gap at $\le$ vs $<$ boundary（定理 5.1.2）
Distributional Rice escape（洞察 5.1.4a）
Goodhart 屬 difference hierarchy（洞察 5.1.4c）

Layer 2（Information / Training）：

Deploy-shift refinement of IRL underdetermination（定理 5.2.1）
PAC Fano bound with explicit constants（定理 5.2.2）
Goodhart phase transition $D_\infty \cdot \epsilon = \Theta(1)$（定理 5.2.3）
Channel-vs-algorithm 根本性對比（定理 5.2.4）

Layer 3（Inner Alignment）：

Mesa-optimizer 三重判準定義（定義 3.0.1）
PRF-based cryptographic indistinguishability（定理 5.3.2）
Sharp-left-turn capability/alignment ratio divergence（定理 5.3.4）

6. 對原始問題的直接回答

原始問題：「如何保證高能力 AI 與人類價值一致？」
答案不是一個單一答案，而是三段。簡單講： (1) 想事後檢查？不行（永遠不行）； (2) 想用例子教？可以但有資訊量天花板，且要換溝通方式； (3) 內層自動對齊？不會——除非密碼學整個崩潰。要救，三層必須同時攻擊；任一層失敗，全系統失敗。

三段答案

1. 若你問「能否事後判定一個高能力 AI 是否對齊」（Layer 1）：不可能（無條件）。在 Turing-universal 計算模型下不可判定（$\Pi^0_2$-complete）。在實際的 polynomial-time AI 上 coNP-hard。驗證 Goodhart 已發生比驗證 alignment 本身嚴格更難。

2. 若你問「能否從訓練資料學出對齊的 AI」（Layer 2）：部分可能，但通道之選擇是 fundamental。純示範通道下 IRL underdetermined，floor $\beta\Delta/2$ 獨立於樣本量；需 preference 或 reward 通道升級，sample complexity $\Theta(\log N / \epsilon^2)$。Goodhart 在 $D_\infty \cdot \epsilon = \Theta(1)$ 處從 $\Theta(\sqrt\epsilon)$-gap 相變至 $\Theta(1)$-gap。

3. 若你問「即使訓練目標完美，學出之模型內部目標是否對齊」（Layer 3）：無條件不可達——存在 PRF-based deceptive mesa-optimizer，任何 polynomial-time behavioral test 之 advantage 為 $\mathrm{negl}(\lambda)$。需 transparency 假設集 $\mathcal{A}_\text{trans}$（ELK frontier）之突破才能保證。

根本原因（Why this happens）

(a) Verification 困難性的本質：alignment 是 semantic property（對 reference $V^*$）。Semantic properties of expressive computation models 受 Rice-style 不可判定性約束。

(b) 資訊量限制：人類給出的訊號（demo / preference / reward）有 information content 上限，由 channel capacity 決定，無法被演算法精緻化突破。

(c) 內外正交性：mesa-optimizer 之 modular structure 使 capability 與 objective 解耦，前者之 generalization 不蘊含後者。Cryptographic 構造可使內外不一致在 polynomial 時間下不可偵測。

因此：對齊需要的是三層獨立攻擊

「保證對齊」需要的不是單一技術，而是三層獨立攻擊：

Layer 1 → 限制 AI 計算 expressivity（落入 promise class）+ 開發 efficient honest verifier
Layer 2 → 通道升級（demo + preference + reward）+ 控制 $D_\infty$（避免 extremal Goodhart）
Layer 3 → 突破 ELK / transparency frontier（mechanistic interpretability + non-syntactic anchors）

任何缺一個 layer 都使全系統失敗。

反對意見與回應

反對 1：「(C3) Turing-universal 不切實際；實際 AI 是 polynomial-time。」
回應：定理 5.1.3 給 (C2) 之 coNP-hardness，仍是 a priori 困難性。實際 AI 之計算限制不消除驗證困難。

反對 2：「PRF security 為 cryptographic 假設，未證實。」
回應：PRF 等價於 OWF 存在性，為 cryptography 之 foundational 假設。若 OWF 不存在則 P = NP，將遠超本研究範圍之衝擊。

反對 3：「Layer 3 之 transparency conditional 過於樂觀——ELK 可能根本無解。」
回應：誠實標明 (T1)(T2)(T3) 為開放問題；定理 5.3.3 為 conditional possibility，不主張 ELK 已解。

反對 4：「全部三 layer 都失敗，是否暗示 alignment 不可能？」
回應：不。Layer 1-2 給「不可達某些保證形式」之精確聲明，但 promise class、informative channels、approximate verification 等 escape routes 仍存在。Layer 3 在 transparency 假設下可能（雖未知如何達成）。最終結論為「條件性 + impossibility map」，非 nihilism。

7. 開放問題

本研究給出三層的精確結果，但還有 12 個開放問題。其中最重要的三個： (1) ELK 之具體實現——這是 Layer 3 的「逃生口」是否真的存在； (2) (C2) 上 ALIGN-VERIFY 的 PSPACE tight bound——能不能把 Layer 1 從 coNP-hard 收緊； (3) (Faithfulness) 假設的 unconditional 證明或反例——這影響 mesa-optimizer 結構分類的嚴謹性。這些都是「對齊問題能否被救」的具體可驗證 milestone。

12 個開放問題（按 layer 分類）

開放問題	對應卡點
(C2)-ALIGN-VERIFY 是否 PSPACE-complete 或 PSPACE-hard？	R1 卡點 §1
Continuous state / action space 之 measurability 如何嚴格處理？	R1 卡點 §2
Mesa-optimizer 之 hyperarithmetic level 之 tight bound？	R1 卡點 §3
Worst-case lifting 之 alternative formalizations 比較？	R1 卡點 §4
Goodhart $\sqrt\epsilon$ vs $\Theta(1)$ regime 之統一 sharp bound？	R2 卡點 §1
Active learning 之 sequential Fano stopping-time bound？	R2 卡點 §3
Mesa-optimizer 之 utility class $\mathcal{F}_0$ 之 algorithm-derived 自然版本？	R3 卡點 §1
Resource-bounded $K^t$ 與 PRF security 之相容性？	R3 卡點 §2
ELK / transparency 假設集 $\mathcal{A}_\text{trans}$ 之具體實現？	R3 卡點 §5
Training algorithm 之 simplicity bias 是否使 deceptive alignment 之 $K^t$-cost 成為 effective barrier？	R3 卡點 §8
(Faithfulness) 是否由 (D3) universality 嚴格蘊含？或反例？	Gap 002
(Non-degeneracy) 假設之 unconditional 證明或反例？	Gap 002

可驗證預測（Falsifiability）

本理論可被以下實驗 / 證明反駁：

若有人證明在 (C2) 上 alignment verification 為 $\mathbf{P}$：定理 5.1.3 之 coNP-hardness 被反駁。
若有人在 (I-demo)-only 通道下達成 $\epsilon < \beta\Delta/2$ 之 PAC alignment（在 §5.2.1 構造之 MDP 上）：定理 5.2.1 被反駁。
若有人構造 polynomial-time behavioral test 對 §5.3.2 之 PRF-based deceptive alignment 達成 non-negligible advantage：要嘛 PRF security 假設崩潰（OWF 不存在 → P = NP），要嘛定理 5.3.2 證明錯誤。
若有人達成 ELK 之 (T1)+(T2)+(T3) 同時實現：定理 5.3.3 之條件性 → 無條件 inner alignment verification，本研究將需更新為 positive result。