Fundamentals 13 min read

Understanding the Internals of Lance’s describe_indices() Method

The article walks through Lance’s describe_indices() workflow—from reading the manifest and caching index metadata, through optional filtering and grouping by logical index name, to building human‑readable index descriptions and highlighting differences from load_indices and index_statistics, while noting edge cases and limitations.

Big Data Technology Tribe

May 27, 2026

Understanding the Internals of Lance’s describe_indices() Method

Problem solved by describe_indices()

In Lance each logical index name maps to multiple physical segments identified by UUIDs. The describe_indices() API reads only the manifest IndexMetadata (plus a small amount of dataset information) and produces a human‑readable description of each logical index without opening the ANN index files.

Overall flow (Python → Rust)

ds.describe_indices()
    │
    ▼
Python: LanceDataset._ds.describe_indices()
    │
    ▼
Rust: Dataset::describe_indices(None)
    │
    ├─① load_indices()      ← read manifest, cache, supplement details
    ├─② optional criteria filtering
    ├─③ group by index.name
    ├─④ IndexDescriptionImpl::try_new() for each group
    └─⑤ wrap as PyIndexDescription

Stage 1 – load_indices() : reading the manifest

The first step always calls load_indices(), which reads the IndexSection from the manifest and converts each entry into an IndexMetadata struct containing fields such as uuid, name, fields, dataset_version, fragment_bitmap, index_details, etc.

pub async fn read_manifest_indexes(...) -> Result<Vec<IndexMetadata>> {
    if let Some(pos) = manifest.index_section.as_ref() {
        let section: pb::IndexSection = read_message(reader.as_ref(), *pos).await?;
        let indices = section.indices
            .into_iter()
            .map(IndexMetadata::try_from)
            .collect<Result<Vec<_>>>()?;
        Ok(indices)
    } else {
        Ok(vec![])
    }
}

Caching

Results are cached per dataset version, so repeated calls with the same version avoid re‑reading the manifest.

let metadata_key = IndexMetadataKey { version: self.version().version };
let mut indices = match self.index_cache.get_with_key(&metadata_key).await {
    Some(indices) => indices,
    None => { /* read_manifest_indexes + retain_supported_indices + cache */ }
};

Filtering unsupported index versions

pub(crate) fn retain_supported_indices(indices: &mut Vec<IndexMetadata>) {
    indices.retain(|idx| idx.index_version <= max_supported_version as i32);
}

Inferring missing vector index details

Legacy indices may lack index_details. The system infers them by name, possibly reading the actual index files.

pub async fn infer_missing_vector_details(dataset: &Dataset, indices: &mut [IndexMetadata]) {
    for index in indices.iter_mut() {
        if let Some(details) = inferred.get(&index.name) {
            index.index_details = Some(details.clone());
        }
    }
}

Fragment‑reuse bitmap remap

If a system index for fragment reuse exists, the fragment_bitmap of every segment is remapped to maintain internal consistency.

Stage 2 – Optional criteria filtering

The Python API passes None, but the Rust implementation can filter indices using an IndexCriteria object.

let mut indices = if let Some(criteria) = criteria {
    indices.iter().filter(|idx| {
        if idx.index_details.is_none() {
            log::warn!("... does not support indexes without index details ...");
            return false;
        }
        index_matches_criteria(idx, &criteria, &fields, false, self.schema())
    }).collect()
} else {
    indices.iter().collect()
};

Segments without index_details are skipped with a warning.

Stage 3 – Grouping by logical index name

indices.sort_by_key(|idx| &idx.name);
let grouped: Vec<Vec<IndexMetadata>> = indices
    .into_iter()
    .chunk_by(|idx| idx.name.clone())
    .into_iter()
    .map(|(_, segments)| segments.cloned().collect())
    .collect();

All segments sharing the same logical name are aggregated into one group, turning many physical segments into a single logical index description.

Stage 4 – IndexDescriptionImpl::try_new() – aggregation and field computation

Consistency checks

Segments must be non‑empty; otherwise there is no metadata.

All segments in a group must share the same name; otherwise the logical index name is inconsistent.

All segments must have identical fields; otherwise the same logical name indexes different columns.

All segments must contain index_details; missing details indicate a legacy index that needs retraining.

All segments must have the same type_url; mixing different index types under one logical name is an error.

Deriving index_type (short name)

let index_type = if details.is_vector() {
    derive_vector_index_type(index_details) // e.g. "IVF_PQ", "IVF_HNSW_SQ"
} else if let Some(system_type) = infer_system_index_type(...) {
    system_type.to_string() // frag‑reuse, mem‑wal, etc.
} else {
    details.get_plugin().map(|p| p.name()).unwrap_or("Unknown")
};

Computing num_rows_indexed

let mut rows_indexed: u64 = 0;
for shard in &segments {
    let fragment_bitmap = shard.fragment_bitmap.as_ref()?;
    for fragment in dataset.get_fragments() {
        if fragment_bitmap.contains(fragment.id() as u32) {
            rows_indexed += fragment.fast_logical_rows()? as u64;
        }
    }
}

The count is the sum of logical rows of fragments covered by the bitmap, not the number of vectors stored in the index file. Overlapping fragments can cause double‑counting.

Computing total_size_bytes

fn total_size_bytes(&self) -> Option<u64> {
    let mut total = 0u64;
    for segment in &self.segments {
        let files = segment.files.as_ref()?;
        for file in files { total += file.size_bytes; }
    }
    Some(total)
}

If any segment lacks a files list, the overall size is None.

Details JSON

Vector indexes: vector_details_as_json provides metric, PQ/SQ parameters, HNSW config, etc.

Scalar indexes: plugin‑specific details_as_json.

Stage 5 – Python wrapper

impl PyIndexDescription {
    pub fn new(index: &dyn IndexDescription, dataset: &LanceDataset) -> Self {
        // field_ids → field_names (lookup schema)
        // segments → Vec<PyIndexSegmentDescription>
        Self { name, fields, field_names, index_type, type_url,
               num_rows_indexed: index.rows_indexed(),
               details: PyJson(...),
               segments,
               total_size_bytes: index.total_size_bytes(), }
    }
}

impl PyIndexSegmentDescription {
    pub fn from_metadata(segment: &IndexMetadata) -> Self {
        // uuid, dataset_version, fragment_ids, index_version, created_at, size_bytes
    }
}

Comparison with load_indices and index_statistics

Granularity : load_indices returns one entry per segment; describe_indices returns one entry per logical index; index_statistics works per index name.

Opening the index : load_indices may open the index (slow); describe_indices deliberately avoids opening; index_statistics opens the index to compute stats.

Row count : describe_indices provides num_rows_indexed (bitmap estimate); index_statistics provides both indexed and unindexed row counts.

Purpose : load_indices is for low‑level metadata; describe_indices lists indices, segment UUIDs, covered fragments, size, and type; index_statistics supports operational/optimization decisions.

Edge cases and pitfalls

Legacy indices without index_details trigger infer_missing_vector_details; if inference fails, the index is skipped with a warning.

Missing fragment_bitmap causes try_new to error, requiring retraining before description. num_rows_indexed reflects manifest bitmap plus current fragment logical rows; it does not equal the exact vector count in the index file.

Overlapping segment bitmaps can lead to double‑counted rows.

The method does not read index contents, so IVF parameters, recall, etc., are only available in the details JSON or via index_stats.

System indexes such as frag‑reuse or mem‑wal appear in the list unless filtered out by criteria. describe_indices() is suited for discovering which indices exist, how many segments they span, which fragments they cover, their total size, and their type, while index_statistics is more appropriate for optimization decisions involving unindexed rows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Rust dataset Lance vector index describe_indices index metadata

Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem solved by describe_indices()

Overall flow (Python → Rust)

Stage 1 – load_indices() : reading the manifest

Caching

Filtering unsupported index versions

Inferring missing vector index details

Fragment‑reuse bitmap remap

Stage 2 – Optional criteria filtering

Stage 3 – Grouping by logical index name

Stage 4 – IndexDescriptionImpl::try_new() – aggregation and field computation

Consistency checks

Deriving index_type (short name)

Computing num_rows_indexed

Computing total_size_bytes

Details JSON

Stage 5 – Python wrapper

Comparison with load_indices and index_statistics

Edge cases and pitfalls

Big Data Technology Tribe

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 – load_indices() : reading the manifest

Stage 2 – Optional criteria filtering

Stage 3 – Grouping by logical index name

Stage 4 – IndexDescriptionImpl::try_new() – aggregation and field computation

Stage 5 – Python wrapper