Cj.putty PDocsData Science
Related
10 Essential Insights into Building a Self-Healing RAG SystemHow to Decide: When to Use Batch vs. Stream Data ProcessingHow to Build and Deploy a Multistage Multimodal Recommender System on Amazon EKSAI Knowledge Base Construction Must Be Iterative, Not One-Time, Experts WarnHow a Self-Healing Layer Eliminates RAG Hallucinations in Real TimeEverything About Why Secure Data Movement Is the Zero Trust Bottleneck Nobody...Crafting an Intelligent Conference Assistant with .NET's Modular AI ToolkitApache Arrow Integration in mssql-python: Frequently Asked Questions

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python

Last updated: 2026-05-20 18:40:05 · Data Science

Introduction

Fetching a million rows from SQL Server into a Polars DataFrame used to mean creating a million Python objects, triggering countless garbage-collector allocations, and then discarding it all to build the DataFrame. That overhead is now gone. The mssql-python driver, thanks to a community contribution by Felix Graßl (@ffelixg), can fetch SQL Server data directly as Apache Arrow structures. This guide walks you through using Arrow support in mssql-python to make your data pipelines faster and more memory-efficient, whether you work with Polars, Pandas, DuckDB, or any other Arrow-native library.

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com

What You Need

  • Python 3.8+ installed on your system.
  • mssql-python driver (version 1.2.0 or later) with Arrow support. Install via pip install mssql-python[arrow].
  • A target library that can consume Arrow data: Polars, Pandas (with ArrowDtype), DuckDB, or similar.
  • SQL Server database with appropriate credentials and network access.

Step-by-Step Guide

Step 1: Install Required Packages

First, ensure you have the latest mssql-python package with Arrow support. Open your terminal and run:

pip install mssql-python[arrow] polars

This installs both the driver and Polars as an example consumer. For Pandas or DuckDB, adjust accordingly.

Step 2: Import Libraries

Create a Python script and import the necessary modules:

import mssql
import polars as pl
from mssql import connect

The mssql module provides the driver; polars will be used to verify the zero-copy integration.

Step 3: Establish a Connection

Use your SQL Server credentials to create a connection. Replace placeholders with your actual server, database, username, and password:

connection = connect(
    server="your_server.database.windows.net",
    database="your_database",
    username="your_username",
    password="your_password"
)
cursor = connection.cursor()

Step 4: Execute a Query and Fetch as Arrow

Execute a SELECT query. The key new method is fetcharrow(), which returns an Arrow Table directly—no Python objects per row are created:

cursor.execute("SELECT TOP 1000000 * FROM large_table")
arrow_table = cursor.fetcharrow()

This call runs entirely in C++, writing values into contiguous, typed Arrow buffers. Nulls are tracked with a compact bitmap—no None objects per cell.

Step 5: Convert to Your DataFrame Library

Because Arrow uses a stable ABI (the Arrow C Data Interface), conversion is instantaneous—zero-copy:

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com
  • To Polars: df = pl.from_arrow(arrow_table)
  • To Pandas: df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
  • To DuckDB: duckdb.sql("SELECT * FROM arrow_table")

No serialization, no copies, no re-parsing—just a pointer exchange.

Step 6: Perform Operations Without Python Overhead

Once data is in an Arrow-native DataFrame, operations like filters, joins, and aggregations also work in-place on the same shared memory buffers. For example, in Polars:

result = df.filter(pl.col("datetime_col") > "2023-01-01").group_by("category").agg(pl.col("value").sum())

No intermediate Python objects are materialized at any stage—this is the foundation for high-throughput pipelines.

Tips and Best Practices

  • Speed gains are most noticeable with temporal types (like DATETIME and DATETIMEOFFSET) because per-value Python-side conversions are eliminated entirely.
  • Memory usage drops dramatically: A column of one million integers becomes a single contiguous C array, not a million individual Python objects. Great for large datasets.
  • Seamless interoperability with other Arrow-native tools—Polars, Pandas (with ArrowDtype), DuckDB, and even tools like Hugging Face datasets—means you can mix and match libraries without conversion costs.
  • Ensure you're using a recent version of mssql-python (1.2.0+) to have the fetcharrow() method available.
  • For best results, avoid fetching rows individually—always use bulk fetch methods like fetcharrow() when working with Arrow.
  • Monitor your garbage collector; with Arrow, you'll see far less GC pressure, making your overall application more predictable.