SQLite vs Pandas: Performance Benchmarks

SQLite vs Pandas: Performance Benchmarks

Previously posted on May 23, 2018, on The Data Incubator

This technical article was written for The Data Incubator by Paul Paczuski, a Fellow of our 2016 Spring cohort in New York City who landed a job with our hiring partner, Genentech as a Clinical Data Scientist.

As a data scientist, we all know that unglamorous data manipulation is 90% of the work. Two of the most common data manipulation tools are SQL and pandas. In this blog, we’ll compare the performance of pandas and SQLite, a simple form of SQL favored by Data Scientists.

Let’s find out the tasks at which each of these excel. Below, we compare Python’s pandas to sqlite for some common data analysis operations: sort, select, load, join, filter, and group by.

Note that the axis is logarithmic, so that raw differences are more pronounced.

For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandas, sqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.

Our sample data was randomly generated. Here’s what it looks like:

This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.

sqlite or memory-sqlite is faster for the following tasks:

pandas is faster for the following tasks:

Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.

There is no significant speedup from loading sqlite in its own shell vs. via pandas.

Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second-fastest operation for larger data).

Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.

All code is on our GitHub page.

Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).

sqlite is first, followed by pandas:

self._cursor.execute(‘SELECT * FROM employee ORDER BY name ASC;’)

‘FROM employee INNER JOIN bonus ON employee.name = bonus.name’)

We used pandas version 0.19.1 and sqlite version 3.13.0. All tests were run on Digital Ocean Ubuntu 14.04 with 16GB memory and 8 core CPU.

For resources on becoming a data scientist, check out our blog, particularly this article on preparing for our free data science fellowship and this one ondata manipulating data like a professional data scientist. And if you’re looking for a class, consider ourconvenient after-work online instructor lead data science foundations course geared towards working professionals or ourfree data science fellowship for people with advanced degrees.

To learn more how sqlite works, check out this awesome blog series.

Here is a syntax comparison between pandas and sql.

Images Powered by Shutterstock