In-Memory Analytics with Apache Arrow | Packt
In-Memory Analytics with Apache Arrow
In-Memory Analytics with Apache Arrow
By Matthew Topol
What do you get with a Packt Subscription?
Instant access to this title and 7,500+ eBooks & Videos
Constantly updated with 100+ new titles each month
Breadth and depth in over 1,000+ technologies
Share Your Thoughts
About this book
Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.
In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow’s versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio’s usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.
By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.
Chapter 1: Getting Started with Apache Arrow
Regardless of whether you are a data scientist/engineer, a machine learning (ML) specialist, or a software engineer trying to build something to perform data analytics, you've probably heard or read about something called Apache Arrow and either looked for more information or wondered what it was. Hopefully, this book can serve as a springboard both in understanding what Apache Arrow is and isn't, and also as a reference book to be continuously utilized in order to supercharge your analytical capabilities.
For now, let's just start off by explaining what Apache Arrow is and what you will use it for. Following that, we will walk through the Arrow specifications, set up a development environment where you can play around with the Apache Arrow libraries, and walk through a few simple exercises to get a feel for how to use them.
In this chapter, we're going to cover the following topics:
Understanding the Arrow format and specifications
Why does Arrow use a columnar in-memory format?
Learning the terminology and the physical memory layout
Arrow format versioning and stability
Setting up your shooting range
For the portion of the chapter describing how to set up a development environment for working with the Arrow libraries, you'll need the following:
Your preferred Integrated Development Environment (IDE): For example, VSCode, Sublime, Emacs, and Vim
Plugins for your desired language (optional but highly recommended)
Interpreter or toolchain for your desired language(s):
Python 3+: pip and venv and/or pipenv
C++ Compiler (capable of compiling C++11 or newer)
Understanding the Arrow format and specifications
According to the Apache Arrow documentation :
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Well, that's a lot of technical jargon! Let's start from the top. Apache Arrow (just Arrow for brevity) is an open source project from the Apache Software Foundation that is released under the Apache License, Version 2.0 . It was co-created by Dremio and Wes McKinney, the creator of pandas, and first released in 2016. To simplify, Arrow is a collection of libraries and specifications that make it easy to build high-performance software utilities for processing and transporting large datasets. It consists of a collection of libraries related to in-memory data processing, including specifications for memory layouts and protocols for sharing and efficiently transporting data between systems and processes. When we're talking about in-memory data processing, we are talking exclusively about the processing of data in RAM and eliminating slow data accesses wherever possible to improve performance. This is where Arrow excels and provides libraries to support this with utilities for streaming and transportation in order to speed up data access.
When working with data, there are two primary situations to consider, and each has different needs: the in-memory format and the on-disk format. When data is stored on disk, the biggest concerns are the size of the data and the input/output (I/O) cost to read it into the main memory before you can operate on it. As a result, formats for data on disk tend to be focused much more on increasing I/O throughput, such as compressing the data to make it smaller and faster to read into memory. One example of this might be the Apache Parquet format, which is a columnar on-disk file format. Instead of being an on-disk format, Arrow's focus is the in-memory format case, which targets CPU efficiency as the goal, with numerous tactics such as cache locality and vectorization of computation.
The primary goal of Arrow is to essentially become the lingua franca of data analytics and processing, the One Framework to Rule Them All, so to speak. Different databases, programming languages, and libraries tend to implement and use their own separate internal formats for managing data, which means that any time you are moving data between these components for different uses, you're paying a cost to serialize and deserialize that data every time. Not only that, but lots of time and resources get spent reimplementing common algorithms and processing in those different data formats over and over. If instead, we can standardize on an efficient, feature-rich internal data format that can be widely adopted and used, this excess computation and development time is no longer necessary. Figure 1.1 shows a simplified diagram of multiple systems, each with its own data format, having to be copied and/or converted in order for the different components to work with each other:
Figure 1.1 – Copy and convert components
In many cases, the serialization and deserialization can end up taking nearly 90% of the processing time in such a system rather than being able to spend that CPU on the analytics. Alternatively, if every component is using Arrow's in-memory format, you end up with a system as in Figure 1.2, where the data can be transferred between components at little-to-no cost. All of the components can either share memory directly or send the data as-is without having to convert between different formats.
Figure 1.2 – Sharing Arrow memory between components
At this point, there's no need for the different components and systems to implement custom connectors or re-implement common algorithms and utilities. The same libraries and connectors can be utilized, even across programming languages and process barriers, by sharing memory directly to refer to the same data rather than copying multiple times between them.
Most data processing systems now use distributed processing by breaking the data up into chunks and sending those chunks across the network to various workers, so even if we can share memory across processes on a box, there's still the cost to send it across the network. This brings us to the final piece of the puzzle: the format of raw Arrow data on the wire is the same as it is in memory. You can directly reference the memory buffers used for the network protocols without having to deserialize that data before you can use it, or reference the memory buffers you were operating on to send it across the network without having to serialize it first. Just a bit of metadata sent along with the raw data buffers and interfaces that perform zero-copies can be created in order to achieve performance benefits, by reducing memory usage and improving CPU throughput.
Let's quickly recap the features of the Arrow format we were just describing before moving on:
Using the same high-performance internal format across components allows much more code reuse in libraries instead of reimplementing common workflows.
The Arrow libraries provide mechanisms to directly share memory buffers to reduce copying between processes by using the same internal representation regardless of the language. This is what is being referred to whenever you see the term zero-copy.
The wire format is the same as the in-memory format to eliminate serialization and deserialization costs when sending data across networks between components of a system.
Now, you might be thinking well this sounds too good to be true! and of course, being skeptical of promises like this is always a good idea. The community around Arrow has done a ton of work over the years to bring these ideas and concepts to fruition. The project itself provides and distributes libraries in a variety of different programming languages so that projects that want to incorporate and/or support the Arrow format don't need to implement it themselves. Above and beyond the interaction with Arrow-formatted data, the libraries provide a significant amount of utility in assisting with common processes such as data access and I/O-related optimizations. As a result, the Arrow libraries can be useful for projects, even if they don't actually utilize the Arrow format themselves.
Here's just a quick sample of use cases where using Arrow as the internal/intermediate data format can be very beneficial:
SQL execution engines (such as Dremio Sonar, Apache Drill, or Impala)
Data analysis utilities and pipelines (such as pandas or Spark)
Streaming and message queue systems (such as Apache Kafka or Storm)
Storage systems and formats (such as Apache Parquet, Cassandra, and Kudu)
As for how Arrow can help you, it depends on which piece of the data puzzle you personally work with. The following are a few different roles that work with data and show how using Arrow could potentially be beneficial; it's by no means a complete list though:
If you're a data scientist:
You can utilize Arrow via pandas and NumPy integration to significantly improve the performance of your data manipulations.
If the tools you use integrate Arrow support, you can gain significant speed-ups to your queries and computations by using Arrow directly yourself to reduce copies and/or serialization costs.
If you are a data engineer specializing in extract, transform, and load (ETL):
The higher adoption of Arrow as an internal and externally-facing format can make it easier to integrate with many different utilities.
By using Arrow, data can be shared between processes and tools with shared memory increasing the tools available to you for building pipelines, regardless of the language you're operating in. You could take data from Python and use it in Spark and then pass it directly to the Java Virtual Machine (JVM) without paying the cost of copying between them.
If you are a software engineer or ML specialist building computation tools and utilities for data analysis:
Arrow as an internal format can be used to improve your memory usage and performance by reducing serialization and deserialization between components.
Understanding how to best utilize the data transfer protocols can improve the ability to parallelize queries and access your data, wherever it might be.
Because Arrow can be used for any sort of tabular data, it can be integrated into many different areas of data analysis and computation pipelines, and is versatile enough to be beneficial as an internal and data transfer format, regardless of the shape of your data.
Now that you know what Arrow is, let's dig into its design and how it delivers on the aforementioned promises of high-performance analytics, zero-copy sharing, and network communication without serialization costs. First, you'll see why a column-oriented memory representation was chosen for Arrow's internal format. Afterward, in later chapters, we'll cover specific integration points, explicit examples, and transfer protocols.
Why does Arrow use a columnar in-memory format?
Most traditional data processing of tabular data will have its own custom data structures for representing and managing those datasets in memory while processing them, such as query engines and data services, for example. Of course, if there are custom data structures, this means it requires developing custom serialization protocols between file formats, network protocols, libraries, and any other interface you could think of. I can vouch from experience that the result is a huge amount of developer time and CPU cycles being wasted dealing with these various serialization schemes, rather than being able to spend it all on the analytical workloads. One goal of the Arrow project is for fewer systems to have to create their own data structures and utilize Arrow as their internal format. Doing so would allow those components to expose Arrow directly as a wire format and benefit from not having to pay a serialization or deserialization cost to pass the data around.
There is often a lot of debate surrounding whether a database should be row-oriented or column-oriented, but this primarily refers to the on-disk format of the underlying storage files. Arrow's data format is different from most cases discussed so far since it uses a columnar organization of data structures in memory directly. If you're not familiar with columnar as a term, let's take a look at what exactly it means. First, imagine the following table of data:
Figure 1.3 – Sample data table
Traditionally, if you were to read this table into memory, you'd likely have some structure to represent a row and then read the data in one row at a time. Maybe something like struct . The result is that you have the memory grouped closely together for each row, which is great if you always want to read all the columns for every row. But, if this were a much bigger table, and you just wanted to find out the minimum and maximum years or any other column-wise analytics such as the unique locations, you would have to read the whole table into memory and then jump around in memory, skipping the fields you didn't care about so that you could read the value for each row of one column.
Most operating systems, while reading data into main memory and CPU caches, will attempt to make predictions about what memory it is going to need next. In our example table of archers, consider how many memory pages of data would have to be accessible and traversed to get a list of unique locations if the data were organized in row or column orientations:
Figure 1.4 – Row versus columnar memory buffers
A columnar format keeps the data organized by column instead of by row, as shown in the preceding figure. As a result, operations such as grouping, filtering, or aggregations based on column values become much more efficient to perform since the entire column is already contiguous in memory. Considering memory pages again, it's plain to see that for a large table, there would be significantly more pages that need to be traversed to get a list of unique locations from a row-oriented buffer than a columnar one. Fewer page faults and more cache hits mean increased performance and a happier CPU. Computational routines and query engines tend to operate on subsets of the columns for a dataset rather than needing every column for a given computation, making it significantly more efficient to operate on columnar data.
If you look closely at the construction of the column-oriented data buffer on the right side of Figure 1.4, you can see how it benefits the queries I mentioned earlier. If we wanted all the archers that are in Europe, we can easily scan through just the location column and discover which rows are the ones we want, and then spin through just the archer block and grab only the rows that correspond to the row indexes we found. This will come into play again when we start looking at the physical memory layout of Arrow arrays; since the data is column-oriented, it makes it easier for the CPU to predict instructions to execute and maintains this memory locality between instructions.
By keeping the column data contiguous in memory, it enables vectorization of the computations. Most modern processors have single instruction, multiple data (SIMD) instructions available that can be taken advantage of for speeding up computations and require having the data in a contiguous block of memory to operate on it. This concept can be found heavily utilized by graphics cards, and in fact, Arrow provides libraries to take advantage of Graphics Processing Units (GPUs) precisely because of this. Consider the example where you might want to multiply every element of a list by a static value, such as performing a currency conversion on a column of prices with an exchange rate:
Figure 1.5 – SIMD/vectorized versus non-vectorized
From the figure, you can see the following:
The left side of the figure shows that an ordinary CPU performing the computation in a non-vectorized fashion requires loading each value into a register, multiplying it with the exchange rate, and then saving the result back into RAM.
On the right side of the figure, we see that vectorized computation, such as using SIMD, performs the same operation on multiple different inputs at the same time, enabling a single load to multiply and save to get the result for the entire group of prices. Being able to vectorize a computation has various constraints; often, one of those constraints is requiring the data being operated on to be in a contiguous chunk of memory, which is why columnar data is much easier to do this with.
SIMD versus Multithreading
If you're not familiar with SIMD, you may wonder how it differs from another parallelization technique: multithreading. Multithreading operates at a higher conceptual level than SIMD. Each thread has its own set of registers and memory space representing its execution context. These contexts could be spread across separate CPU cores or possibly interleaved by a single CPU core switching whenever it needs to wait for I/O. SIMD is a processor-level concept that refers to the specific instructions being executed. Put simply, multithreading is multitasking and SIMD is doing less work to achieve the same result.
Another benefit of utilizing column-oriented data comes into play when considering compression techniques. At some point, your data will become large enough that sending it across the network could become a bottleneck, purely due to size and bandwidth. With the data being grouped together in columns that are all the same type as contiguous memory, we end up with significantly better compression ratios than we would get with the same data in a row-oriented configuration, simply because data of the same type is easier to compress together than data of different types.
Learning the terminology and physical memory layout
As mentioned before, the Arrow columnar format specification includes definitions of the in-memory data structures, metadata serialization, and protocols for data transportation. The format itself has a few key promises, as follows:
Data adjacency for sequential access
O(1) (constant time) random access
SIMD and vectorization friendly
Relocatable, allowing for zero-copy access in shared-memory
To ensure we're all on the same page, here's a quick glossary of terms that are used throughout the format specification and the rest of the book:
Array: A list of values with a known length of the same type.
Slot: The value in an array identified by a specific index.
Buffer/contiguous memory region: A single contiguous block of memory with a given length.
Physical layout: The underlying memory layout for an array without accounting for the interpretation of the logical value. For example, a 32-bit signed integer array and a 32-bit floating-point array are both laid out as contiguous chunks of memory where each value is made up of four contiguous bytes in the buffer.
Parent/child arrays: Terms used for the relationship between physical arrays when describing the structure of a nested type. For example, a struct parent array has a child array for each of its fields.
Primitive type: A type that has no child types and so consists of a single array, such as fixed-bit-width arrays (for example, int32) or variable-size types (for example, string arrays).
Nested type: A type that depends on one or more other child types. Nested types are only equal if their child types are also equal (for example, List
and List are equal if T and U are equal).
Logical type: A particular type of interpreting the values in an array that is implemented using a specific physical layout. For example, the decimal logical type stores values as 16 bytes per value in a fixed-size binary layout. Similarly, a timestamp logical type stores values using a 64-bit fixed-size layout.
Now that we've got the fancy words out of the way, let's have a look at how we actually lay out these arrays in memory. An array or vector is defined by the following information:
A logical data type (typically identified by an enum value and metadata)
A group of buffers
A length as a 64-bit signed integer
A null count as a 64-bit signed integer
Optionally, a dictionary for dictionary-encoded arrays (more on these later in the chapter)
To define a nested array type, there would additionally be one or more sets of this information that would then be the child arrays. Arrow defines a series of logical types and each one has a well-defined physical layout in the specification. For the most part, the physical layout just affects the sequence of buffers that make up the raw data. Since there is a null count in the metadata, it comes as a given that any value in an array may be considered to be null data rather than having a value, regardless of the type. Apart from the union data type, all the arrays have a validity bitmap as one of their buffers, which can optionally be left out if there are no nulls in the array. As might be expected, 1 in the corresponding bit means it is a valid value in that index, and 0 means it's null.
Quick summary of physical layouts, or TL;DR
When working with Arrow formatted data, it's important to understand how it is physically laid out in memory. Understanding the physical layouts can provide ideas for efficiently constructing (or deconstructing) Arrow data when developing applications. Here's a quick summary:
Figure 1.6 – Table of physical layouts
The following is a walkthrough of the physical memory layouts that are used by the Arrow format. This is primarily useful for either implementing the Arrow specification yourself (or contributing to the libraries) or if you simply want to know what's going on under the hood and how it all works.
Primitive fixed-length value arrays
Let's look at an example of a 32-bit integer array that looks like this: [1, null, 2, 4, 8]. What would the physical layout look like in memory based on the information so far (Figure 1.7)? Something to keep in mind is that all of the buffers should be padded to a multiple of 64 bytes for alignment, which matches the largest SIMD instructions available on widely deployed x86 architecture processors (Intel AVX-512), and that the values for null slots are marked UNF or undefined. Implementations are free to zero out the data in null slots if they desire, and many do. But, since the format specification does not define anything, the data in a null slot could technically be anything.
Figure 1.7 – Layout of primitive int32 array
This same conceptual layout is the case for any fixed-size primitive type, with the only exception being that the validity buffer can be left out entirely if there are no nulls in the array. For any data type that is physically represented as simple fixed-bit-width values, such as integers, floating-point values, fixed-size binary arrays, or even timestamps, it will use this layout in memory. The padding for the buffers in the subsequent diagrams will be left out just to avoid cluttering them.
Variable-length binary arrays
Things get slightly trickier when dealing with variable-length value arrays, generally used for variable size binary or string data. In this layout, every value can consist of 0 or more bytes and, in addition to the data buffer, there will also be an offsets buffer. Using an offsets buffer allows the entirety of the data of the array to be held in a single contiguous memory buffer. The only lookup cost for finding the value of a given index is to look up the indexes in the offsets buffer to find the correct slice of the data. The offsets buffer will always contain length + 1 signed integers (either 32-bit or 64-bit, based on the logical type being used) that indicate the starting position of each corresponding slot of the array. Consider the array of two strings: [ "Water", "Rising" ].
Figure 1.8 – Arrow string versus traditional string vector
This differs from a lot of standard ways of representing a list of strings in memory in most library models. Generally, a string is represented as a pointer to a memory location and an integer for the length, so a vector of strings is really a vector of these pointers and lengths (Figure 1.8). For many use cases, this is very efficient since, typically, a single memory address is going to be much smaller than the size of the string data, so passing around this address and length is efficient for referencing individual strings.
Figure 1.9 – Viewing string index 1
If your goal is operating on a large number of strings though, it's much more efficient to have a single buffer to scan through in memory. As you operate on each string, you can maintain the memory locality we mentioned before, keeping the memory we need to look at physically close to the next chunk of memory we're likely going to need. This way, we spend less time jumping around different pages of memory and can spend more CPU cycles performing the computations. It's also extremely efficient to get a single string, as you can simply take a view of the buffer by using the address indicated by the offset to create a string object without copying the data.
List and fixed-size list arrays
What about nested formats? Well, they work in a similar way to the variable-length binary format. First up is the variable-size list layout. It's defined by two buffers, a validity bitmap and an offsets buffer, along with a child array. The difference between this and the variable-length binary format is that instead of the offsets referencing a buffer, they are instead indexes into the child array (which could itself potentially be a nested type). The common denotation of list types is to specify them as List
, where T is any type at all. When using 64-bit offsets instead of 32-bit, it is denoted as LargeList
. Let's represent the following List
array: [[12, -7, 25], null, [0, -127, 127, 50], ].
Figure 1.10 – Layout of list array
The first thing to notice in the preceding diagram is that the offsets buffer has exactly one more element than the List array it belongs to since there are four elements to our List
array and we have five elements in the offsets buffer. Each value in the offsets buffer represents the starting slot of the corresponding list index i. Looking closer at the offsets buffer, we notice that 3 and 7 are repeating, indicating that those lists are either null or empty (have a length of 0). To discover the length of a list at a given slot, you simply take the difference between the offset for that slot and the offset after it:
, and the same holds true for the previous variable-length binary format; the number of bytes for a given slot is the difference in the offsets. Knowing this, what is the length of the list at index 2 of Figure 1.10?
(remember, 0-based indexes!). With this, we can tell that the list at index 3 is empty because the bitmap has a 1, but the length is 0 (7 – 7). This also explains why we need that extra element in the offsets buffer! We need it to be able to calculate the length of the last element in the array.
Given that example, what would a List
array look like? I'll leave that as an exercise for you to figure out.
There's also a FixedSizeList
type, which works nearly the same as the variable-sized list, except there's no need for an offsets buffer. The child array of a fixed-size list type is the values array, complete with its own validity buffer. The value in slot
of a fixed-size list array is stored in an
-long slice of the values array, starting at offset
. Figure 1.11 shows what this looks like:
Figure 1.11 – Layout of fixed-size list array
What's the benefit of FixedSizeList versus List? Look back at the two diagrams again! Determining the values for a given slot of FixedSizeList doesn't require any lookups into a separate offsets buffer, making it more efficient if you know that your lists will always be a specific size. As a result, you also save space by not needing the extra memory for an offsets buffer at all!
One thing to keep in mind is the semantic difference between a null value and an empty list. Using JSON notation, the difference is equivalent to the difference between null and . The meaning of such a difference would be up to a particular application to decide, but it's important to note that a null list is not identical to an empty list, even though the only difference in the physical representation is the bit in the validity bitmap.
Phew! That was a lot. We're almost done!
The next type on our tour of the Arrow format is the struct type's layout. A struct is a nested type that has an ordered sequence of fields that can all have their own distinct types. It's semantically very similar to a simple object with attributes that you might find in a variety of programming languages. Each field must have its own UTF-8 encoded name, and these field names are part of the metadata for defining a struct type. Instead of having any physical storage allocated for its values, a struct array has one child array for each of its fields. All of these children arrays are independent and don't need to be adjacent to each other in memory; remember our goal is column- (or field-) oriented, not row-oriented. A struct array must, however, have a validity bitmap if it contains one or more null struct values. It can still contain a validity bitmap if there are no null values, it's just optional in that case.
Let's use the example of a struct with the following structure: Struct
. An array of this type would have two child arrays, one VarBinary array (a variable-sized binary layout), and one 4-byte primitive value array having a logical type of Int32. With this definition, we can map out a representation of the array: [, , null, ].
Figure 1.12 – Layout of struct array
When an entire slot of the struct array is set to null, the null is represented in the parent's validity bitmap, which is different from a particular value in a child array being null. In Figure 1.12, the child arrays each have a slot for the null struct in which they could have any value at all, but would be hidden by the struct array's validity bitmap marking the corresponding struct slot as null and taking priority over the children.
Union arrays – sparse and dense
For the case when a single column could have multiple types, there exists the Union type array. Whereas the struct array is an ordered sequence of fields, a union type is an ordered sequence of types. The value in each slot of the array could be of any of these types, which are named like struct fields and included in the metadata of the type. Unlike other layouts, the union type does not have its own validity bitmap. Instead, each slot's validity is determined by the children, which are composed to create the union array itself. There are two distinct union layouts that can be used when creating an array: dense and sparse, each optimized for a different use case.
A dense union represents a mixed-type array with 5 bytes of overhead for each value, and contains the following structures:
One child array for each type
A types buffer: A buffer of 8-bit signed integers, with each value representing the type ID for the corresponding slot, indicating which child vector to read from for that slot
An offsets buffer: A buffer of signed 32-bit integers, indicating the offset into the corresponding child's array for the type in each slot
The dense union allows for the common use case of a union of structs with non-overlapping fields: Union
. Here's an example of the layout for a union of type Union
with the values [, null, , ]:
Figure 1.13 – Layout of dense union array
A sparse union has the same structure as the dense, except without an offsets array, as each child array is equal in length to the union itself. Figure 1.14 shows the same union array from Figure 1.13 as a sparse union array. There's no offsets buffer; both children are the same length of 4 as opposed to being different lengths:
Figure 1.14 – Layout of sparse union array
Even though a sparse union takes up significantly more space compared to a dense union, it has some advantages for specific use cases. In particular, a sparse union is much more easily used with vectorized expression evaluation in many cases, and a group of equal length arrays can be interpreted as a union by only having to define the types buffer. When interpreting a sparse union, only the slot in a child indicated by the types array is considered; the rest of the unselected values are ignored and could be anything.
Next, we arrive at the layout for dictionary-encoded arrays. If you have data that has many repeated values, then significant space can potentially be saved by using dictionary encoding to represent the data values as integers referencing indexes into a dictionary that usually consists of unique values. Since a dictionary is an optional property on any array, any array can be dictionary-encoded. The layout of a dictionary-encoded array is that of a primitive integer array of non-negative integers, which each represent the index in the dictionary. The dictionary itself is a separate array with its own respective layout of the appropriate type.
For example, let's say you have the following array: ["foo", "bar", "foo", "bar", null, "baz"]. Without dictionary encoding, we'd have an array that looks like this:
Figure 1.15 – String array without dictionary encoding
If we add dictionary encoding, we just need to get the unique values and create an array of indexes that references a dictionary array. The common case is to use int32, but any integral type would work:
Figure 1.16 – Dictionary-encoded string array
For this trivial example, it's not particularly enticing, but it's very clear how, in the case of an array with a lot of repeated values, this could be a significant memory usage improvement. You can even perform operations directly on a dictionary array, updating the dictionary if needed or even swapping out the dictionary and replacing it.
As written in the specification, a dictionary is allowed to contain duplicates and even null values. However, the null count of a dictionary-encoded array is dictated by the validity bitmap of the indices, regardless of any nulls that might be in the dictionary itself.
Finally, there is only one more layout, but it's simple: a null array. A null array is an optimized layout for an array of all null values, with the type set to null; the only thing it contains is a length, no validity bitmap, and no data buffer.
How to speak Arrow
We've mentioned a few of the logical types already when describing the physical layouts, but let's get a full description of the current available logical types, as of Arrow release version 7.0.0, for your reading pleasure. In general, the logical types are what is referred to as the data type of an array in the libraries rather than the physical layouts. These types are what you will generally see when working with Arrow arrays in code:
Null logical type: Null physical type
Boolean: Primitive array with data represented as a bitmap
Primitive integer types: Primitive, fixed-size array layout:
Int8, Uint8, Int16, Uint16, Int32, Uint32, Int64, and Uint64
Floating-point types: Primitive fixed-size array layout:
Float16, Float32 (float), and Float64 (double)
VarBinary types: Variable length binary physical layout:
Binary and String (UTF-8)
LargeBinary and LargeString (variable length binary with 64-bit offsets)
Decimal128 and Decimal256: 128-bit and 256-bit fixed-size primitive arrays with metadata to specify the precision and scale of the values
Fixed-size binary: Fixed-size binary physical layout
Temporal types: Primitive fixed-size array physical layout
Date types: Dates with no time information:
Date32: 32-bit integers representing the number of days since the Unix epoch (1970-01-01)
Date64: 64-bit integers representing milliseconds since the Unix epoch (1970-01-01)
Time types: Time information with no date attached:
Time32: 32-bit integers representing elapsed time since midnight as seconds or milliseconds. A unit specified by metadata.
Time64: 64-bit integers representing elapsed time since midnight as microseconds or nanoseconds. A unit specified by metadata.
Timestamp: 64-bit integer representing the time since the Unix epoch, not including leap seconds. Metadata defines the unit (seconds, milliseconds, microseconds, or nanoseconds) and, optionally, a time zone as a string.
Interval types: An absolute length of time in terms of calendar artifacts:
YearMonth: Number of elapsed whole months as a 32-bit signed integer.
DayTime: Number of elapsed days and milliseconds as two consecutive 4-byte signed integers (8-bytes total per value).
MonthDayNano: Elapsed months, days, and nanoseconds stored as contiguous 16-byte blocks. Months and days as two 32-bit integers and nanoseconds since midnight as a 64-bit integer.
Duration: An absolute length of time not related to calendars as a 64-bit integer and a unit specified by metadata indicating seconds, milliseconds, microseconds, or nanoseconds.
List and FixedSizeList: Their respective physical layouts:
LargeList: A list type with 64-bit offsets
Struct, DenseUnion, and SparseUnion types: Their respective physical layouts
Map: A logical type that is physically represented as List
, where K and V are the respective types of the keys and values in the map:
Metadata is included indicating whether or not the keys are sorted.
Whenever speaking about the types of an array from an application or semantic standpoint, we will always be using the types indicated in the preceding list to describe them. As you can see, the logical types make it very easy to represent both flat and hierarchical types of data. Now that we've covered the physical memory layouts, let's have a quick word about the versioning and stability of the Arrow format and libraries.
Arrow format versioning and stability
In order to ensure confidence that updating the version of the Arrow library in use won't break applications and the long-term stability of the Arrow project, there are two versions used to describe each release of the project: The format version and the library version. Different library implementations and releases can have different versions, but will always be implementing a specific format version. From version 1.0.0 onward, semantic versioning is used with releases.
Provided the major version of the format is the same between two libraries, any new library is backward-compatible with any older library with regards to being able to read data and metadata produced by an older library. Increases in the minor version of the format, such as an increase from version 1.0.0 to version 1.1.0, indicate new features that were added. As long as these new features are not used (such as new logical types or physical layouts), older libraries will be able to read data and metadata produced by newer versions of the libraries.
As far as the long-term stability of the format and libraries, only increases in the major version of the format would indicate any issue with the previous guarantees about compatibility. The Arrow project says that they do not expect this to be a frequent occurrence, rather it would be an exceptional event, in which case such a release would exercise caution for deployment. As a result of these compatibility guarantees, it ends up being safe and simple to ensure backward and forward compatibility when using the Arrow libraries and format.
Would you download a library? Of course!
On top of those, there are libraries for C (Glib) , MATLAB , Python , R , and Ruby, which are all built on top of the C++ library, which happens to have the most active development. As you might expect, the various implementations all have different stages as far as what features and aspects of the specification are implemented, and the documentation helpfully provides an implementation matrix showing what features are implemented in which libraries. The implementation matrix  is then updated as these aspects of the specification and features are implemented in a given library.
With so many different implementations, you might be concerned about interoperability between them. As a result, the various library versions are integration tested via automated continuous integration (CI) jobs in order to ensure this interoperability among them. Depending on the language and development, these libraries are tested on a very large variety of platforms, including but not limited to the following:
R packages on CRAN ( https://cran.r-project.org/ )
Julia packages in the general registry ( https://github.com/JuliaRegistries/General )
Ruby packages with RubyGems ( https://rubygems.org/ )
C# packages with NuGet ( https://www.nuget.org/packages/Apache.Arrow/ )
APT and Yum repositories for various Debian, Ubuntu, Red Hat, and CentOS distributions
Java Artifacts on Maven Central
Pip wheels for Python
When developing something that will utilize the Arrow libraries, keep the terms that were mentioned a few pages ago in mind, as most of the libraries utilize similar terminology and naming for describing their Application Programming Interfaces (APIs).
Setting up your shooting range
By now, you should have a pretty solid understanding of what Arrow is, the basics of how it's laid out in memory, and the basic terminology. So now, let's set up a development environment where you can test out and play with Arrow. For the purposes of this book, I'm going to primarily focus on the three libraries that I'm most familiar with: the C++ library, the Python library, and the Go library. While the basic concepts will apply to all of the implementations, the precise APIs may differ between them so, armed with the knowledge gained so far, you should be able to make sense of the documentation for your preferred language, even without precise examples for that language being printed here.
For each of C++, Python, and Go, after the instructions for installing the Arrow library, I'll go through a few exercises to get you acquainted with the basics of using the Arrow library in that language.
Using pyarrow For Python
With data science being a primary target of Arrow, it's no surprise that the Python library tends to be the most commonly used and interacted with by developers. Let's start with a quick introduction to setting up and using the pyarrow library for development.
Most modern IDEs provide plugins with exceptional Python support so you can fire up your preferred Python development IDE. I highly recommend using one of the methods for creating virtual environments with Python, such as pipenv, venv, or virtualenv, for setting up your environment. After creating that virtual environment, in most cases, installing pyarrow is as simple as using pip to install it:
$ pipenv install pyarrow # this or $ python3 -m venv arrow_playground && pip3 install pyarrow # this
It's also possible that, depending on your settings and platform, pip may attempt to build pyarrow locally. You can use the –-prefer-binary or –-only-binary arguments to tell pip to install the pre-build binary package rather than build from source:
$ pip3 install pyarrow --only-binary pyarrow
Alternately to using pip, Conda  is a common toolset utilized by data scientists and engineers, and the Arrow project provides binary Conda packages on conda-forge  for Linux, macOS, and Windows for Python 3.6+. You can install it with Conda and conda-forge as follows:
$ conda install pyarrow=6.0.* -c conda-forge
Understanding the basics of pyarrow
With the package installed, first let's confirm that the package installed successfully by opening up the Python interpreter and trying to import the package:
>>> import pyarrow as pa
You should probably be pretty good at this by now if you did this exercise in the Python and C++ libraries already!
The goal of this chapter was to explain what Apache Arrow is, get you acquainted with the format, and have you use it in some simple use cases. This knowledge forms the baseline of everything else for us to talk about in the rest of the book!
Just as a reminder, you can check the GitHub repository ( https://github.com/PacktPublishing/In-Memory-Analytics-with-Apache-Arrow- ) for the solutions to the exercises presented here and for the full code samples to make sure you understand the concepts!
The previous examples and exercises are all fairly trivial and are meant to help reinforce the concepts introduced about the format and the specification while helping you get familiar with using Arrow in code.
In Chapter 2 , Working with Key Arrow Specifications, we will introduce how to read your data into the Arrow format, whether it's on your local disk, Hadoop Distributed File System (HDFS), S3, or elsewhere, and integrate Arrow into some of the various processes and utilities you might already use with your data, such as the pandas integration. We will also discover how to pass your data around between services and processes while keeping it in the Arrow format for performance.
Ready? Onward and upward!
Here's a list of the URL references we made in this chapter since there were quite a lot!
Apache Arrow documentation: https://arrow.apache.org/docs/