How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python

Introduction

Fetching a million rows from SQL Server into a Polars DataFrame used to mean creating a million Python objects, triggering countless garbage-collector allocations, and then discarding it all to build the DataFrame. That overhead is now gone. The mssql-python driver, thanks to a community contribution by Felix Graßl (@ffelixg), can fetch SQL Server data directly as Apache Arrow structures. This guide walks you through using Arrow support in mssql-python to make your data pipelines faster and more memory-efficient, whether you work with Polars, Pandas, DuckDB, or any other Arrow-native library.

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com

What You Need

Step-by-Step Guide

Step 1: Install Required Packages

First, ensure you have the latest mssql-python package with Arrow support. Open your terminal and run:

pip install mssql-python[arrow] polars

This installs both the driver and Polars as an example consumer. For Pandas or DuckDB, adjust accordingly.

Step 2: Import Libraries

Create a Python script and import the necessary modules:

import mssql
import polars as pl
from mssql import connect

The mssql module provides the driver; polars will be used to verify the zero-copy integration.

Step 3: Establish a Connection

Use your SQL Server credentials to create a connection. Replace placeholders with your actual server, database, username, and password:

connection = connect(
    server="your_server.database.windows.net",
    database="your_database",
    username="your_username",
    password="your_password"
)
cursor = connection.cursor()

Step 4: Execute a Query and Fetch as Arrow

Execute a SELECT query. The key new method is fetcharrow(), which returns an Arrow Table directly—no Python objects per row are created:

cursor.execute("SELECT TOP 1000000 * FROM large_table")
arrow_table = cursor.fetcharrow()

This call runs entirely in C++, writing values into contiguous, typed Arrow buffers. Nulls are tracked with a compact bitmap—no None objects per cell.

Step 5: Convert to Your DataFrame Library

Because Arrow uses a stable ABI (the Arrow C Data Interface), conversion is instantaneous—zero-copy:

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python
Source: devblogs.microsoft.com

No serialization, no copies, no re-parsing—just a pointer exchange.

Step 6: Perform Operations Without Python Overhead

Once data is in an Arrow-native DataFrame, operations like filters, joins, and aggregations also work in-place on the same shared memory buffers. For example, in Polars:

result = df.filter(pl.col("datetime_col") > "2023-01-01").group_by("category").agg(pl.col("value").sum())

No intermediate Python objects are materialized at any stage—this is the foundation for high-throughput pipelines.

Tips and Best Practices

Tags:

Recommended

Discover More

Porsche Abruptly Closes E-Bike, Battery, and Software Units in Major Cost-Cutting OverhaulKubernetes v1.36 Fixes Critical Kubelet API Permission Flaw with New Authorization Feature Now GA5 Essential AWS Updates You Need to Know This Week (April 13, 2026)How to Automate Browser Driver Management in Selenium with WebDriverManagerTerraform Enterprise 2.0: Scaling Infrastructure Management with Advanced Orchestration