Visualize BigQuery data in Jupyter notebooks  |  Google Cloud

Visualize BigQuery data in Jupyter notebooks  |  Google Cloud

A notebook provides an environment in which to author and execute code. A notebook is essentially a source artifact, saved as an IPYNB file. It can contain descriptive text content, executable code blocks, and output rendered as interactive HTML.

Structurally, a notebook is a sequence of cells. A cell is a block of input text that is evaluated to produce results. Cells can be of three types:

The following image shows a Markdown cell that's followed by a Python code cell, and then followed by the output:

Each opened notebook is associated with a running session (also known as a kernel in Python). This session executes all the code in the notebook, and it manages the state. The state includes the variables with their values, functions and classes, and any existing Python modules that you load.

In Google Cloud, you can use a Vertex AI Workbench notebook-based environment to query and explore data, develop and train a model, and run your code as part of a pipeline. In this tutorial, you create a managed notebook instance on Vertex AI Workbench and then explore BigQuery data within the JupyterLab interface.

In this section, you set up a JupyterLab instance on Google Cloud so that you can to create managed notebooks.

In this section, you open JupyterLab and explore the BigQuery resources that are available in a managed notebooks instance.

In this section, you write SQL directly in notebook cells and read data from BigQuery into the Python notebook.

Magic commands that use a single or double percentage character ( or ) let you use minimal syntax to interact with BigQuery within the notebook. The BigQuery client library for Python is automatically installed in a managed notebook. Behind the scenes, the magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.

Note: As of version 1.26.0 of the Python package, the BigQuery Storage API is used by default to download results from the magics.

To get the number of regions by country in the dataset, enter the following statement: The output is similar to the following:
Query complete after 0.07s: 100%|██████████| 4/4 [00:00<00:00, 1440.60query/s]
Downloading: 100%|██████████| 41/41 [00:02<00:00, 20.21rows/s]
country_code country_name num_regions
0 TR Turkey 81
1 TH Thailand 77
2 VN Vietnam 63
3 JP Japan 47
4 RO Romania 42
5 NG Nigeria 37
6 IN India 36
7 ID Indonesia 34
8 CO Colombia 33
9 MX Mexico 32
10 BR Brazil 27
11 EG Egypt 27
12 UA Ukraine 27
13 CH Switzerland 26
14 AR Argentina 24
15 FR France 22
16 SE Sweden 21
17 HU Hungary 20
18 IT Italy 20
19 PT Portugal 20
20 NO Norway 19
21 FI Finland 18
22 NZ New Zealand 17
23 PH Philippines 17
...
Note: Your results might differ from what is above as the dataset being queried is refreshed with new data on an ongoing basis. In the next cell (below the output from the previous cell), enter the following command to run the same query, but this time save the results to a new pandas DataFrame that's named . You provide that name by using an argument with the magic command. Note: For more information about available arguments for the command, see the client library magics documentation. In the next cell, enter the following command to look at the first few rows of the query results that you just read in: The pandas DataFrame is ready to plot.

In this section, you use the BigQuery client library for Python directly to read data into the Python notebook.

The client library gives you more control over your queries and lets you use more complex configurations for queries and jobs. The library's integrations with pandas enable you to combine the power of declarative SQL with imperative code (Python) to help you analyze, visualize, and transform your data.

Note: You can use a number of Python data analysis, data wrangling, and visualization libraries, such as , , , and many others. Several of these libraries are built on top of a DataFrame object.

In the next cell, enter the following Python code to import the BigQuery client library for Python and initialize a client: The BigQuery client is used to send and receive messages from the BigQuery API. In the next cell, enter the following code to retrieve the percentage of daily top terms in the US top_terms that overlap across time by number of days apart. The idea here is to look at each day's top terms and see what percentage of them overlap with the top terms from the day before, 2 days prior, 3 days prior, and so on (for all pairs of dates over about a month span).
sql = """
WITH
TopTermsByDate AS (
SELECT DISTINCT refresh_date AS date, term
FROM `bigquery-public-data.google_trends.top_terms`
),
DistinctDates AS (
SELECT DISTINCT date
FROM TopTermsByDate
)
SELECT
DATE_DIFF(Dates2.date, Date1Terms.date, DAY)
AS days_apart,
COUNT(DISTINCT (Dates2.date || Date1Terms.date))
AS num_date_pairs,
COUNT(Date1Terms.term) AS num_date1_terms,
SUM(IF(Date2Terms.term IS NOT NULL, 1, 0))
AS overlap_terms,
SAFE_DIVIDE(
SUM(IF(Date2Terms.term IS NOT NULL, 1, 0)),
COUNT(Date1Terms.term)
) AS pct_overlap_terms
FROM
TopTermsByDate AS Date1Terms
CROSS JOIN
DistinctDates AS Dates2
LEFT JOIN
TopTermsByDate AS Date2Terms
ON
Dates2.date = Date2Terms.date
AND Date1Terms.term = Date2Terms.term
WHERE
Date1Terms.date <= Dates2.date
GROUP BY
days_apart

ORDER BY
days_apart;
"""
pct_overlap_terms_by_days_apart = client.query(sql).to_dataframe()

pct_overlap_terms_by_days_apart.head()
The SQL being used is encapsulated in a Python string and then passed to the query() method to run a query. The to_dataframe method waits for the query to finish and downloads the results to a pandas DataFrame by using the BigQuery Storage API. The first few rows of query results appear below the code cell. Note: Your results might differ from what is above as the dataset being queried is refreshed with new data on an ongoing basis.

For more information about using BigQuery client libraries, see the quickstart Using client libraries.

In this section, you use plotting capabilities to visualize the results from the queries that you previously ran in your Jupyter notebook.

In the next cell, enter the following code to use the pandas method to create a bar chart that visualizes the results of the query that returns the number of regions by country: The chart is similar to the following: In the next cell, enter the following code to use the pandas method to create a scatter plot that visualizes the results from the query for the percentage of overlap in the top search terms by days apart: The chart is similar to the following. The size of each point reflects the number of date pairs that are that many days apart in the data. For example, there are more pairs that are 1 day apart than 30 days apart because the top search terms are surfaced daily over about a month's time.

For more information about data visualization, see the pandas documentation.

In this section, you use a notebook shortcut to get summary statistics and visualizations for all fields of a BigQuery table.

The BigQuery client library provides a magic command, , that you can call with a specific table name to provide an overview of the table and detailed statistics on each of the table's columns.

To view your query history as a tab in JupyterLab, perform the following steps:

In this section, you save your notebook and download it if you want it for future use after cleaning up the resources used in this tutorial.

Images Powered by Shutterstock