Skip to content

heuristics.py

ofrak.core.flash.heuristics

Heuristics and heuristic framework for the flash geometry heuristic analyzer.

Every heuristic carries a HeuristicSpec (uniform scoring parameters) and returns a HeuristicEvidence from evaluate. The analyzer treats every entry in the heuristic list identically.

EccExactVerifyHeuristic dataclass

GlobalHeuristic wrapper around _verify_exact_ecc.

EccOnlyHeuristic dataclass

Detects Linux MTD large-page layouts: the first 40 OOB bytes are 0xFF while the 24-byte ECC region is populated.

GeometryCandidate dataclass

A (data_size, oob_size) pairing that evenly divides the file being analyzed.

Parameters:

Name Type Description Default
data_size

size in bytes of the data region of one page

required
oob_size

size in bytes of the out-of-band / spare region of one page

required
num_pages

number of complete pages the candidate divides the file into

required

GlobalHeuristic (Protocol)

Heuristic that runs its own pass over the whole image via evaluate.

HeuristicEvidence dataclass

Per-heuristic result produced by a single evaluation pass.

Parameters:

Name Type Description Default
spec

the spec used to drive the evaluation

required
hits

number of pages the heuristic matched

required
pages_examined

number of pages the heuristic was given a chance to match

required

score(self)

Compute the heuristic's uniform contribution to the primary signal score.

Returns:

Type Description
int

hits * weight if hits clear the spec's noise floor, else 0

Source code in ofrak/core/flash/heuristics.py
def score(self) -> int:
    """
    Compute the heuristic's uniform contribution to the primary signal score.

    :return: `hits * weight` if hits clear the spec's noise floor, else 0
    """
    if self.pages_examined <= 0:
        return 0
    if self.hits < self.spec.min_hits(self.pages_examined):
        return 0
    return self.hits * self.spec.weight

HeuristicSpec dataclass

Uniform scoring parameters attached to every heuristic.

Parameters:

Name Type Description Default
name

short identifier used to look up the heuristic's evidence

required
weight

multiplier applied to the hit count when contributing to oob_signal_score

required
absolute_min_hits

floor on hits below which the heuristic's contribution is zeroed

required
relative_min_hit_rate

fractional floor on hits per page examined, combined with absolute_min_hits via max()

required

OobPageCategory (Enum)

Classification of a single page's spare area.

PerPageOobHeuristic (Protocol)

Heuristic whose per-page hit decision depends only on a page's OOB.

Implementations supply check_page and evaluate_heuristics batches them into one shared scan loop via _scan_per_page_batched.

ScanConfig dataclass

Tunables for the evidence-collection pass.

Parameters:

Name Type Description Default
oob_scan_cap_pages

cap on total pages inspected by per-page heuristics and the entropy / gap sampler

required
ecc_verify_page_cap

cap on pages inspected by the exact-ECC verifier

required
ecc_verify_enabled

whether to run the exact-ECC verifier at all. This can be disabled since it is expensive.

required
gap_sample_count

number of page boundaries charged against the gap-sampling budget

required
gap_sample_max_scan

maximum page index that the gap sampler is allowed to consider

required
entropy_enabled

whether to accumulate Shannon entropy of the data and OOB regions

required

SmallPageEccHeuristic dataclass

Matches small-page NAND OOB (<= 16 bytes) that looks densely populated.

Classic small-page (512+16) OOB layout keeps byte 5 as the bad-block marker (0xFF for good blocks) and packs ECC + metadata across the remaining bytes. A page matches when the OOB is (a) shorter than the Linux large-page layout, (b) has 0xFF at the bad-block marker offset (if long enough to carry one), (c) the remaining bytes are densely populated with ECC/metadata.

The density floor is intentionally strict: partial-erased data windows that happen to land 0xFF at the BBM position produce many FF bytes elsewhere, so requiring "nearly all non-BBM bytes populated" rejects them and leaves only genuine small-page OOB structure. Without this heuristic, (512, 16) candidates never score on OOB content and always lose the preference-index tie-break.

Parameters:

Name Type Description Default
bbm_offset

offset inside the OOB that is expected to carry 0xFF (bad-block marker)

required
max_ff_outside_bbm

maximum number of 0xFF bytes outside bbm_offset before the OOB is rejected as "probably just an erased / partially-erased window"

required

Yaffs2PackedTagsHeuristic dataclass

Detects the YAFFS2 "packed tags 2" marker plus a plausible n_bytes field in the OOB.

evaluate_heuristics(data, candidate, scan, heuristics)

Run every heuristic against a candidate and return one evidence per heuristic.

The analyzer calls this entry point. Per-page heuristics (anything exposing check_page) are batched into a single shared scan loop; all other heuristics are run individually.

Parameters:

Name Type Description Default
data bytes

the full flash image bytes

required
candidate GeometryCandidate

the candidate geometry under evaluation

required
scan ScanConfig

tunables for the evidence-collection pass

required
heuristics Sequence[Union[ofrak.core.flash.heuristics.GlobalHeuristic, ofrak.core.flash.heuristics.PerPageOobHeuristic]]

the ordered list of heuristics to evaluate

required

Returns:

Type Description
List[ofrak.core.flash.heuristics.HeuristicEvidence]

one HeuristicEvidence per entry in heuristics, in the same order

Source code in ofrak/core/flash/heuristics.py
def evaluate_heuristics(
    data: bytes,
    candidate: GeometryCandidate,
    scan: ScanConfig,
    heuristics: Sequence[Heuristic],
) -> List[HeuristicEvidence]:
    """
    Run every heuristic against a candidate and return one evidence per heuristic.

    The analyzer calls this entry point. Per-page heuristics (anything exposing `check_page`)
    are batched into a single shared scan loop; all other heuristics are run individually.

    :param data: the full flash image bytes
    :param candidate: the candidate geometry under evaluation
    :param scan: tunables for the evidence-collection pass
    :param heuristics: the ordered list of heuristics to evaluate

    :return: one `HeuristicEvidence` per entry in `heuristics`, in the same order
    """
    per_page: List[PerPageOobHeuristic] = [
        cast(PerPageOobHeuristic, h) for h in heuristics if hasattr(h, "check_page")
    ]
    other: List[GlobalHeuristic] = [
        cast(GlobalHeuristic, h) for h in heuristics if not hasattr(h, "check_page")
    ]

    results: List[HeuristicEvidence] = []
    if per_page:
        results.extend(_scan_per_page_batched(data, candidate, scan, per_page))
    results.extend(h.evaluate(data, candidate, scan) for h in other)

    by_name = {ev.name: ev for ev in results}
    return [by_name[h.spec.name] for h in heuristics]

_scan_per_page_batched(data, candidate, scan, heuristics) private

Shared per-page loop for heuristics that expose check_page.

Source code in ofrak/core/flash/heuristics.py
def _scan_per_page_batched(
    data: bytes,
    candidate: GeometryCandidate,
    scan: ScanConfig,
    heuristics: Sequence[PerPageOobHeuristic],
) -> List[HeuristicEvidence]:
    """
    Shared per-page loop for heuristics that expose `check_page`.
    """
    pages_available = len(data) // candidate.total_chunk_size
    pages_to_scan = min(scan.oob_scan_cap_pages, pages_available)
    hits = [0] * len(heuristics)
    pages_examined = 0
    for page in range(pages_to_scan):
        base = page * candidate.total_chunk_size
        oob_off = base + candidate.data_size
        if oob_off + candidate.oob_size > len(data):
            break
        oob = data[oob_off : oob_off + candidate.oob_size]
        for i, h in enumerate(heuristics):
            hits[i] += h.check_page(oob, candidate.data_size)
        pages_examined += 1
    return [
        HeuristicEvidence(h.spec, hits=n, pages_examined=pages_examined)
        for h, n in zip(heuristics, hits)
    ]

classify_oob(oob)

Classify a page's OOB bytes against the common NAND spare-area conventions.

Parameters:

Name Type Description Default
oob bytes

the out-of-band / spare bytes for a single page

required

Returns:

Type Description
OobPageCategory

the matching OobPageCategory (MIXED if nothing else fits)

Source code in ofrak/core/flash/heuristics.py
def classify_oob(oob: bytes) -> OobPageCategory:
    """
    Classify a page's OOB bytes against the common NAND spare-area conventions.

    :param oob: the out-of-band / spare bytes for a single page

    :return: the matching `OobPageCategory` (`MIXED` if nothing else fits)
    """
    n = len(oob)
    if n == 0:
        return OobPageCategory.EMPTY
    if all(b == 0xFF for b in oob):
        return OobPageCategory.ALL_FF
    if n >= LINUX_MTD_LARGE_PAGE_OOB_SIZE:
        header_erased = oob[:LINUX_MTD_OOB_ECC_OFFSET] == b"\xff" * LINUX_MTD_OOB_ECC_OFFSET
        ecc_region = oob[LINUX_MTD_OOB_ECC_OFFSET:LINUX_MTD_OOB_ECC_END]
        if header_erased and ecc_region != b"\xff" * LINUX_MTD_OOB_ECC_LEN:
            return OobPageCategory.ECC_ONLY
        if oob[0] == 0xFF and oob[1] == YAFFS2_TAG_MARKER_VALUE:
            return OobPageCategory.TAGGED
    return OobPageCategory.MIXED

_linux_mtd_hamming_ecc_256(sector) private

Compute the 3-byte Linux MTD soft-Hamming ECC over one 256-byte sector.

Parameters:

Name Type Description Default
sector bytes

exactly 256 bytes of page data

required

Returns:

Type Description
bytes

the 3-byte ECC code for sector

Exceptions:

Type Description
ValueError

if sector is not exactly 256 bytes

Source code in ofrak/core/flash/heuristics.py
def _linux_mtd_hamming_ecc_256(sector: bytes) -> bytes:
    """
    Compute the 3-byte Linux MTD soft-Hamming ECC over one 256-byte sector.

    :param sector: exactly 256 bytes of page data

    :raises ValueError: if `sector` is not exactly 256 bytes

    :return: the 3-byte ECC code for `sector`
    """
    if len(sector) != LINUX_MTD_HAMMING_SECTOR_SIZE:
        raise ValueError("Hamming step must be exactly 256 bytes")
    w = iter(struct.unpack("<64I", sector))

    def fold_u8(x: int) -> int:
        x ^= x >> 16
        x ^= x >> 8
        return x & 0xFF

    par = 0
    acc = [0, 0, 0, 0]
    rp12 = rp14 = 0
    t = 0
    for q in range(4):
        for assign, ops in _HAM256_ECC_STEPS:
            cur = next(w)
            if assign:
                t = cur
            else:
                t ^= cur
            for rpi, use_t in ops:
                v = t if use_t else cur
                acc[rpi] ^= v
        par ^= t
        if (q & 1) == 0:
            rp12 ^= t
        if (q & 2) == 0:
            rp14 ^= t
    rp4, rp6, rp8, rp10 = acc
    rp4, rp6, rp8, rp10, rp12, rp14 = map(fold_u8, (rp4, rp6, rp8, rp10, rp12, rp14))

    rp3 = ((par >> 16) ^ ((par >> 16) >> 8)) & 0xFF
    rp2 = ((par & 0xFFFF) ^ ((par & 0xFFFF) >> 8)) & 0xFF
    par ^= par >> 16
    rp1 = (par >> 8) & 0xFF
    rp0 = par & 0xFF
    par ^= par >> 8
    par &= 0xFF

    rp5 = (par ^ rp4) & 0xFF
    rp7 = (par ^ rp6) & 0xFF
    rp9 = (par ^ rp8) & 0xFF
    rp11 = (par ^ rp10) & 0xFF
    rp13 = (par ^ rp12) & 0xFF
    rp15 = (par ^ rp14) & 0xFF

    inv = _ECC_INVPARITY
    lo = (rp7, rp6, rp5, rp4, rp3, rp2, rp1, rp0)
    hi = (rp15, rp14, rp13, rp12, rp11, rp10, rp9, rp8)
    code1 = sum(inv[b] << (7 - j) for j, b in enumerate(lo))
    code0 = sum(inv[b] << (7 - j) for j, b in enumerate(hi))
    code2 = (
        sum(inv[par & m] << (7 - k) for k, m in enumerate((0xF0, 0x0F, 0xCC, 0x33, 0xAA, 0x55))) | 3
    )
    return bytes((code0 & 0xFF, code1 & 0xFF, code2 & 0xFF))

_verify_exact_ecc(data, candidate, page_cap) private

Verify Linux MTD soft-Hamming ECC on a sampled subset of pages.

A non-zero match count is a very strong signal that the geometry guess is correct. Only meaningful when the OOB is at least 64 bytes and the data region is a multiple of 256 bytes; otherwise returns (0, 0).

Parameters:

Name Type Description Default
data bytes

the full flash image bytes

required
candidate GeometryCandidate

the candidate geometry to check

required
page_cap int

maximum number of pages to sample

required

Returns:

Type Description
Tuple[int, int]

(pages_checked, exact_matches) where pages_checked counts non-erased pages on which the check ran, and exact_matches counts pages whose 8 ECC triplets all match their recomputed value byte-for-byte

Source code in ofrak/core/flash/heuristics.py
def _verify_exact_ecc(
    data: bytes,
    candidate: GeometryCandidate,
    page_cap: int,
) -> Tuple[int, int]:
    """
    Verify Linux MTD soft-Hamming ECC on a sampled subset of pages.

    A non-zero match count is a very strong signal that the geometry guess is correct.
    Only meaningful when the OOB is at least 64 bytes and the data region is a multiple of
    256 bytes; otherwise returns `(0, 0)`.

    :param data: the full flash image bytes
    :param candidate: the candidate geometry to check
    :param page_cap: maximum number of pages to sample

    :return: `(pages_checked, exact_matches)` where `pages_checked` counts non-erased pages on
        which the check ran, and `exact_matches` counts pages whose 8 ECC triplets all match
        their recomputed value byte-for-byte
    """
    if (
        candidate.oob_size < LINUX_MTD_LARGE_PAGE_OOB_SIZE
        or candidate.data_size < LINUX_MTD_HAMMING_SECTOR_SIZE
        or candidate.data_size % LINUX_MTD_HAMMING_SECTOR_SIZE != 0
    ):
        return 0, 0

    total_chunk_size = candidate.total_chunk_size
    num_pages = len(data) // total_chunk_size
    pages_to_sample = min(page_cap, num_pages)
    if pages_to_sample <= 0:
        return 0, 0

    n_sectors = min(
        candidate.data_size // LINUX_MTD_HAMMING_SECTOR_SIZE,
        LINUX_MTD_HAMMING_MAX_SECTORS,
    )
    all_ff_sector = b"\xff" * LINUX_MTD_HAMMING_SECTOR_SIZE
    erased_prefix = all_ff_sector * n_sectors

    # Stride the sample evenly across the whole image so we catch structured
    # regions even when they live deep in the file (common for chip dumps
    # with blank preambles).
    stride = max(1, num_pages // pages_to_sample)

    pages_checked = 0
    matches = 0
    for k in range(pages_to_sample):
        page_idx = k * stride
        if page_idx >= num_pages:
            break
        base = page_idx * total_chunk_size
        data_start = base
        data_end = base + candidate.data_size
        oob_ecc = data[data_end + LINUX_MTD_OOB_ECC_OFFSET : data_end + LINUX_MTD_OOB_ECC_END]
        if len(oob_ecc) != LINUX_MTD_OOB_ECC_LEN:
            break
        page_data = data[data_start:data_end]
        if page_data[: n_sectors * LINUX_MTD_HAMMING_SECTOR_SIZE] == erased_prefix:
            continue
        pages_checked += 1
        all_exact = True
        for s in range(n_sectors):
            sector = page_data[
                s * LINUX_MTD_HAMMING_SECTOR_SIZE : (s + 1) * LINUX_MTD_HAMMING_SECTOR_SIZE
            ]
            expected = _linux_mtd_hamming_ecc_256(sector)
            actual = oob_ecc[s * 3 : s * 3 + 3]
            if actual != expected:
                all_exact = False
                break
        if all_exact:
            matches += 1
    return pages_checked, matches