Skip to content

Add client-side batching for CreateMultiple/UpdateMultiple/UpsertMultiple #156

@suyask-msft

Description

@suyask-msft

Problem

CreateMultiple, UpdateMultiple, and UpsertMultiple each send all records in a single POST regardless of count. There is no client-side chunking.

  • _create_multiple (data/_odata.py:316-376) — builds one {"Targets": [...]} payload with every record and POSTs it.
  • _update_multiple (data/_odata.py:656-697) — same pattern.
  • _upsert_multiple (data/_odata.py:440-493) — same pattern.

Dataverse has a server-side limit (typically 1,000 records per *Multiple call). Sending more can result in 400/413 errors or timeouts. Today, callers must chunk manually in their scripts. The SDK should handle this internally.

Proposed changes

1. Client-side batching (correctness fix)

Split large record lists into 1,000-record chunks and send each as a separate POST. This is the minimum viable fix.

# Pseudocode for _create_multiple with batching
BATCH_SIZE = 1000

def _create_multiple(self, entity_set, table_schema_name, records):
    all_ids = []
    for i in range(0, len(records), BATCH_SIZE):
        chunk = records[i:i + BATCH_SIZE]
        ids = self._create_multiple_batch(entity_set, table_schema_name, chunk)
        all_ids.extend(ids)
    return all_ids

Atomicity trade-off: Today a single POST is atomic (all-or-nothing). Splitting into batches means partial success is possible — batch 1 succeeds, batch 2 fails, leaving the caller with a partial import. This should be clearly documented. Callers who need atomicity should limit their input to <=1000 records.

2. Optional concurrent batch dispatch (performance, follow-on)

After batching exists, add an opt-in max_workers parameter to dispatch batches concurrently via concurrent.futures.ThreadPoolExecutor (stdlib, no new dependency).

def create(self, table, data, *, max_workers=1):
    # max_workers=1 (default) = sequential, identical to today
    # max_workers=4 = 4 concurrent batch POSTs

Default must be 1 (sequential) to avoid any regression:

  • No extra threads on slow machines
  • No extra memory overhead
  • No concurrent request spike hitting Dataverse rate limits
  • Identical behavior to today unless user explicitly opts in

When max_workers > 1:

  • Uses ThreadPoolExecutor (~8MB stack per thread, bounded by max_workers)
  • Respects 429 (rate limit) responses — backs off all workers
  • Connection pooling via existing _HttpClient session support

3. Page pre-fetching in _get_multiple (separate enhancement)

_get_multiple (data/_odata.py:821-826) fetches pages sequentially in a while next_link loop. Each page blocks until complete before the next is requested.

Pre-fetching 1 page ahead while the caller processes the current page would overlap I/O with processing:

def _get_multiple(self, ..., prefetch_pages=0):
    # prefetch_pages=0 (default) = sequential, identical to today
    # prefetch_pages=1 = fetch next page while caller processes current

Default must be 0 to avoid buffering extra pages in memory. A single pre-fetched page for a 5,000-record default page size is ~5-20MB depending on column count — acceptable when opted in, but shouldn't be forced.

4. Picklist cache warming (separate enhancement)

_optionset_map (data/_odata.py:1219-1331) makes 2 HTTP calls per string field on cache miss. The cache works well for subsequent records, but the first record with N string fields triggers 2N sequential HTTP calls.

A warm_picklist_cache(table) method that fetches all picklist metadata for a table in a single request would eliminate the cold-start penalty for bulk operations.

APIs NOT proposed for parallelism

API Why not
Chunked file upload (_upload.py:117-195) Protocol is sequential by design — uses session token with Content-Range headers, each chunk returns 206 before next can be sent
Column creation (_odata.py:1712-1762) Dataverse metadata locks on the same table can cause conflicts with concurrent POSTs
Column deletion (_odata.py:1764-1831) Same metadata lock concern
Relationship creation (_relationships.py) Same metadata lock concern
BulkDelete (_odata.py:548-618) Already async server-side; splitting into concurrent jobs adds complexity with minimal benefit

Context

Identified during end-to-end validation of a 21-table dataset import. The agent-generated script had to implement its own chunking (chunk_size=1000) because the SDK doesn't handle it. Client-side batching should be an SDK responsibility, not something every caller reinvents.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions