Checklist for Improving the Superset Integration for a Database

Checklist for Improving the Superset Integration for a Database

In an earlier post, Building Database Connectors for Superset Using SQLAlchemy , I wrote about how Superset talks to SQL databases using:
a SQLAlchemy dialect + a companion Python DB-API 2 library to enable basic functionality
some database-specific engine code to enable all features (called db_engine_spec in Superset)
Using CrateDB as a reference point, I showcased the mandatory and highly recommended properties and methods that need to be implemented in the custom database engine.
In this post, I”ll walkthrough the full checklist that covers the rest of the properties and methods you can implement in a database engine to improve the experience in Superset!
Mandatory
To quickly recap:
To generate queries, Superset has an API layer that lives in  superset/db_engine_specs . The base API that's common for all databases is exposed in  superset/db_engine_specs/base.py .
To better support the nuances of each database, individual database engines can be extended using  BaseEngineSpec . Each database engine should live in a separate file (e.g. bigquery.py for Google BigQuery and clickhouse.py for ClickHouse).
At a minimum, a custom database engine needs to have the following attributes set so the database is shown in the Add Database modal (if the database driver is installed):
engine
Here’s a screenshot of this in action from the CrateDB database engine :
Highly Recommended
Next up are a set of properties and functions that are core the Superset experience and are highly recommended to implement.
Time Grain Expressions
Superset’s roots are in time series visualization, where slicing time by specific time grains is a critical feature.
If you want people to create time-series charts and choose the granularity of the time bucketing in Explore, then you’ll need to define the _time_grain_expressions dictionary as a class attribute.
class attribute: _time_grain_expressions
Dictionary that defines the mapping from ISO-8601 durations to SQL expression that truncate a temporal value to the specified time grain.
The following is a reference from the BaseEngineSpec class:
builtin_time_grains: Dict[Optional[str], str] =
Here’s an example implementation from CrateDB that implements a subset of these:
_time_grain_expressions = ", "PT1S": "DATE_TRUNC('second', )", "PT1M": "DATE_TRUNC('minute', )", "PT1H": "DATE_TRUNC('hour', )", "P1D": "DATE_TRUNC('day', )", "P1W": "DATE_TRUNC('week', )", "P1M": "DATE_TRUNC('month', )", "P3M": "DATE_TRUNC('quarter', )", "P1Y": "DATE_TRUNC('year', )", }
Convert Python DateTime to SQL Expression
Superset’s backend is written in Python and this function is needed to help bridge the Python-SQL conversion.
class method: convert_dttm() defines how to convert a Python date or datetime object to a SQL expression
Here’s an example implementation from CrateDB:
@classmethod def convert_dttm( cls, target_type: str, dttm: datetime, db_extra: Optional[Dict[str, Any]] = None ) -> Optional[str]: tt = target_type.upper() if tt == utils.TemporalType.TIMESTAMP: return f"" return None
Convert Epoch Value to DateTime
SQL expression that converts an epoch Unix time (seconds since 1970-01-01) to a datetime type.
class method: epoch_to_dttm()
Here’s an example implementation from CrateDB:
@classmethod def epoch_to_dttm(cls) -> str: return " * 1000"
Customize SQLAlchemy String
A user registers a new database to connect to in Superset either by manually typing the SQLAlchemy connection string or by filling out a form in the Add Database screen that builds the string on the user’s behalf.
If the SQLAlchemy connection string needs to be customized somehow (e.g. for connecting to a specific schema in the database), then the following function is the right place to define this logic.
class method: adjust_database_uri
Here’s an example implementation from Presto that mutates the connection string and appends / followed by the selected_schema value:
@classmethod def adjust_database_uri( cls, uri: URL, selected_schema: Optional[str] = None ) -> None: database = uri.database if selected_schema and database: selected_schema = parse.quote(selected_schema, safe="") if "/" in database: database = database.split("/")[0] + "/" + selected_schema else: database += "/" + selected_schema uri.database = database
Recommended
Next up in the checklist is a set of recommended methods & properties.
Better Database Exceptions
Superset talks to databases to run queries using the Python DB-API 2.0 driver for that specific database. Sometimes, the DB-API driver returns an error / exception (e.g. if the query contains syntax errors). Exceptions from the Python DB-API driver can be mapped to Superset exceptions to provide better feedback to end-users in Superset.
class method: get_dbapi_exception_mapping()
Here’s an example from the ElasticSearch database engine :
@classmethod def get_dbapi_exception_mapping(cls) -> Dict[Type[Exception], Type[Exception]]: # pylint: disable=import-error,import-outside-toplevel import es.exceptions as es_exceptions return
The dictionary above tells Superset how to translate the exceptions from the elasticsearch driver to the Superset exceptions. Also notice the exception classes that were importing from Superset at the top of the database engine:
from superset.db_engine_specs.exceptions import ( SupersetDBAPIDatabaseError, SupersetDBAPIOperationalError, SupersetDBAPIProgrammingError, )
Column Type Mappings
Many databases implement custom data types for columns and need to be mapped to both SQLAlchemy types and the GenericDataType in Superset. Note that the BaseEngineSpec that all database engines inherit from cover most of the typical ANSI SQL column types like CHAR, VARCHAR, INT, etc so only non-standard types need to be handled in the database engine. At the time of writing, only three database engines in Superset implement column type mappings: MySQL, Postgres, and Presto.
attribute: column_type_mappings: Tuple[ColumnTypeMapping, ...]
Let’s look at a simple example from PostgresEngineSpec:
column_type_mappings = ( ( re.compile(r"^double precision", re.IGNORECASE), DOUBLE_PRECISION(), GenericDataType.NUMERIC, ),
Here’s a breakdown of each tuple value:
re.compile(r"^double precision", re.IGNORECASE): a regular expression that matches against text like “double precision”
DOUBLE_PRECISION(): the double precision type from Postgres SQLAlchemy
GenericDataType.NUMERIC: the simplified Superset data type we want it mapped to
Superset has a few GenericDataTypes that all column types map to simplify things for visualization:
NUMERIC
This can be found in the class definition for GenericDataType .
Here’s a preview of the base template from BaseEngineSpec, which maps a lot of the standard column types:
column_type_mappings: Tuple[ColumnTypeMapping, ...] = ( ( re.compile(r"^string", re.IGNORECASE), types.String(), GenericDataType.STRING, ), ( re.compile(r"^n((var)?char|text)", re.IGNORECASE), types.UnicodeText(), GenericDataType.STRING, ), ( re.compile(r"^(var)?char", re.IGNORECASE), types.String(), GenericDataType.STRING, ), ..... )
From the 3 tuples above, we can see that column types that look like “string”, “nchar”, “varchar”, and “text” will be mapped to different SQLAlchemy types but the same Superset GenericDataType (of String!).
Convert Epoch Value (in Milliseconds) to DateTime
SQL expression that converts an epoch Unix time (seconds since 1970-01-01) in milliseconds to a datetime type. Note the difference between this and epoch_to_dttm from the earlier section in this post!
class method: epoch_ms_to_dttm()
Here’s an example implementation from CrateDB:
@classmethod def epoch_ms_to_dttm(cls) -> str: return ""
Max Column Name Length
Databases often have different limitations on how long a column name or alias can be. When a column name or alias exceeds this value, it’ll be replaced by a truncated MD5 hash representation of the full, untrucated column name or alias at query time to ensure that the query can be executed.
attribute: max_column_name_length (integer value)
In the following screenshot, you can see the diversity of column name lengths that different databases support.
Optional (Nice to Have)
Lastly, you can specify a list of function names to aid in auto-completion in SQL Lab (Superset’s SQL Editor):
class method: get_function_names()
Here’s how ClickHouse implements this function in ClickhouseEngineSpec:
@classmethod @cache_manager.cache.memoize() def get_function_names(cls, database: "Database") -> List[str]: system_functions_sql = "SELECT name FROM system.functions" try: df = database.get_df(system_functions_sql) if cls._show_functions_column in df: return df[cls._show_functions_column].tolist() columns = df.columns.values.tolist() logger.error( ...
Interestingly, ClickHouse has a way to return the list of functions using SQL and that feature is used here to return the list of available functions for auto-completion.
Checklist
Here’s a shortened, glanceable checklist to use when building a custom database engine:
Mandatory
attribute: engine : name of the SQLAlchemy dialect
attribute: engine_name : name that should be displayed in the Superset UI
Highly Recommended
attribute: _time_grain_expressions : mapping from ISO-8601 durations to SQL expressions that truncate a temporal value to the specified time grain.
class method: convert_dttm : convert a Python date/datetime object to a SQL expression
class method: epoch_to_dttm : SQL expression that converts an epoch value to a datetime type
class method: adjust_database_uri : function to customize the SQLAlchemy connection string if needed (e.g. appending the chosen schema to the connection string)
Recommended
class method: get_dbapi_exception_mapping() : mapping from database driver specific exceptions to Superset exceptions to showcase the correct feedback to the end-user.
attribute: column_type_mappings : tuple of values to map non-standard column types to both the SQLAlchemy type and the Superset GenericDataType
class method: epoch_ms_to_dttm : SQL expression that converts an epoch value (in milliseconds) to a datetime type
attribute: max_column_name_length: max length of a column name / alias that the database supports
Optional (Nice to Have)

Images Powered by Shutterstock