Fundamentals 13 min read

Understanding the Internals of Lance’s describe_indices() Method

The article walks through Lance’s describe_indices() workflow—from reading the manifest and caching index metadata, through optional filtering and grouping by logical index name, to building human‑readable index descriptions and highlighting differences from load_indices and index_statistics, while noting edge cases and limitations.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
Understanding the Internals of Lance’s describe_indices() Method

Problem solved by describe_indices()

In Lance each logical index name maps to multiple physical segments identified by UUIDs. The describe_indices() API reads only the manifest IndexMetadata (plus a small amount of dataset information) and produces a human‑readable description of each logical index without opening the ANN index files.

Overall flow (Python → Rust)

ds.describe_indices()
    │
    ▼
Python: LanceDataset._ds.describe_indices()
    │
    ▼
Rust: Dataset::describe_indices(None)
    │
    ├─① load_indices()      ← read manifest, cache, supplement details
    ├─② optional criteria filtering
    ├─③ group by index.name
    ├─④ IndexDescriptionImpl::try_new() for each group
    └─⑤ wrap as PyIndexDescription

Stage 1 – load_indices() : reading the manifest

The first step always calls load_indices(), which reads the IndexSection from the manifest and converts each entry into an IndexMetadata struct containing fields such as uuid, name, fields, dataset_version, fragment_bitmap, index_details, etc.

pub async fn read_manifest_indexes(...) -> Result<Vec<IndexMetadata>> {
    if let Some(pos) = manifest.index_section.as_ref() {
        let section: pb::IndexSection = read_message(reader.as_ref(), *pos).await?;
        let indices = section.indices
            .into_iter()
            .map(IndexMetadata::try_from)
            .collect<Result<Vec<_>>>()?;
        Ok(indices)
    } else {
        Ok(vec![])
    }
}

Caching

Results are cached per dataset version, so repeated calls with the same version avoid re‑reading the manifest.

let metadata_key = IndexMetadataKey { version: self.version().version };
let mut indices = match self.index_cache.get_with_key(&metadata_key).await {
    Some(indices) => indices,
    None => { /* read_manifest_indexes + retain_supported_indices + cache */ }
};

Filtering unsupported index versions

pub(crate) fn retain_supported_indices(indices: &mut Vec<IndexMetadata>) {
    indices.retain(|idx| idx.index_version <= max_supported_version as i32);
}

Inferring missing vector index details

Legacy indices may lack index_details. The system infers them by name, possibly reading the actual index files.

pub async fn infer_missing_vector_details(dataset: &Dataset, indices: &mut [IndexMetadata]) {
    for index in indices.iter_mut() {
        if let Some(details) = inferred.get(&index.name) {
            index.index_details = Some(details.clone());
        }
    }
}

Fragment‑reuse bitmap remap

If a system index for fragment reuse exists, the fragment_bitmap of every segment is remapped to maintain internal consistency.

Stage 2 – Optional criteria filtering

The Python API passes None, but the Rust implementation can filter indices using an IndexCriteria object.

let mut indices = if let Some(criteria) = criteria {
    indices.iter().filter(|idx| {
        if idx.index_details.is_none() {
            log::warn!("... does not support indexes without index details ...");
            return false;
        }
        index_matches_criteria(idx, &criteria, &fields, false, self.schema())
    }).collect()
} else {
    indices.iter().collect()
};

Segments without index_details are skipped with a warning.

Stage 3 – Grouping by logical index name

indices.sort_by_key(|idx| &idx.name);
let grouped: Vec<Vec<IndexMetadata>> = indices
    .into_iter()
    .chunk_by(|idx| idx.name.clone())
    .into_iter()
    .map(|(_, segments)| segments.cloned().collect())
    .collect();

All segments sharing the same logical name are aggregated into one group, turning many physical segments into a single logical index description.

Stage 4 – IndexDescriptionImpl::try_new() – aggregation and field computation

Consistency checks

Segments must be non‑empty; otherwise there is no metadata.

All segments in a group must share the same name; otherwise the logical index name is inconsistent.

All segments must have identical fields; otherwise the same logical name indexes different columns.

All segments must contain index_details; missing details indicate a legacy index that needs retraining.

All segments must have the same type_url; mixing different index types under one logical name is an error.

Deriving index_type (short name)

let index_type = if details.is_vector() {
    derive_vector_index_type(index_details) // e.g. "IVF_PQ", "IVF_HNSW_SQ"
} else if let Some(system_type) = infer_system_index_type(...) {
    system_type.to_string() // frag‑reuse, mem‑wal, etc.
} else {
    details.get_plugin().map(|p| p.name()).unwrap_or("Unknown")
};

Computing num_rows_indexed

let mut rows_indexed: u64 = 0;
for shard in &segments {
    let fragment_bitmap = shard.fragment_bitmap.as_ref()?;
    for fragment in dataset.get_fragments() {
        if fragment_bitmap.contains(fragment.id() as u32) {
            rows_indexed += fragment.fast_logical_rows()? as u64;
        }
    }
}

The count is the sum of logical rows of fragments covered by the bitmap, not the number of vectors stored in the index file. Overlapping fragments can cause double‑counting.

Computing total_size_bytes

fn total_size_bytes(&self) -> Option<u64> {
    let mut total = 0u64;
    for segment in &self.segments {
        let files = segment.files.as_ref()?;
        for file in files { total += file.size_bytes; }
    }
    Some(total)
}

If any segment lacks a files list, the overall size is None.

Details JSON

Vector indexes: vector_details_as_json provides metric, PQ/SQ parameters, HNSW config, etc.

Scalar indexes: plugin‑specific details_as_json.

Stage 5 – Python wrapper

impl PyIndexDescription {
    pub fn new(index: &dyn IndexDescription, dataset: &LanceDataset) -> Self {
        // field_ids → field_names (lookup schema)
        // segments → Vec<PyIndexSegmentDescription>
        Self { name, fields, field_names, index_type, type_url,
               num_rows_indexed: index.rows_indexed(),
               details: PyJson(...),
               segments,
               total_size_bytes: index.total_size_bytes(), }
    }
}

impl PyIndexSegmentDescription {
    pub fn from_metadata(segment: &IndexMetadata) -> Self {
        // uuid, dataset_version, fragment_ids, index_version, created_at, size_bytes
    }
}

Comparison with load_indices and index_statistics

Granularity : load_indices returns one entry per segment; describe_indices returns one entry per logical index; index_statistics works per index name.

Opening the index : load_indices may open the index (slow); describe_indices deliberately avoids opening; index_statistics opens the index to compute stats.

Row count : describe_indices provides num_rows_indexed (bitmap estimate); index_statistics provides both indexed and unindexed row counts.

Purpose : load_indices is for low‑level metadata; describe_indices lists indices, segment UUIDs, covered fragments, size, and type; index_statistics supports operational/optimization decisions.

Edge cases and pitfalls

Legacy indices without index_details trigger infer_missing_vector_details; if inference fails, the index is skipped with a warning.

Missing fragment_bitmap causes try_new to error, requiring retraining before description. num_rows_indexed reflects manifest bitmap plus current fragment logical rows; it does not equal the exact vector count in the index file.

Overlapping segment bitmaps can lead to double‑counted rows.

The method does not read index contents, so IVF parameters, recall, etc., are only available in the details JSON or via index_stats.

System indexes such as frag‑reuse or mem‑wal appear in the list unless filtered out by criteria. describe_indices() is suited for discovering which indices exist, how many segments they span, which fragments they cover, their total size, and their type, while index_statistics is more appropriate for optimization decisions involving unindexed rows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonRustdatasetLancevector indexdescribe_indicesindex metadata
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.