Google Cloud Storage (GCS)
Overview
This destination writes data to GCS bucket.
The Airbyte GCS destination allows you to sync data to cloud storage buckets. Each stream is written to its own directory under the bucket.
Sync overview
Features
Feature | Support | Notes |
---|---|---|
Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured bucket path. |
Incremental - Append Sync | ✅ | Warning: Airbyte provides at-least-once delivery. Depending on your source, you may see duplicated data. Learn more here |
Incremental - Append + Deduped | ❌ | |
Namespaces | ❌ | Setting a specific bucket path is equivalent to having separate namespaces. |
Getting started
Requirements
- Allow connections from Airbyte server to your GCS cluster (if they exist in separate VPCs).
- An GCP bucket with credentials (for the COPY strategy).
Setup guide
- Fill up GCS info
- GCS Bucket Name
- See this for instructions on how to create a GCS bucket. The bucket cannot have a retention policy. Set Protection Tools to none or Object versioning.
- GCS Bucket Region
- HMAC Key Access ID
- See this on how to generate an access key. For more information on hmac keys please reference the GCP docs
- We recommend creating an Airbyte-specific user or service account. This user or account will require the following permissions for the bucket:
You can set those by going to the permissions tab in the GCS bucket and adding the appropriate the email address of the service account or user and adding the aforementioned permissions.
storage.multipartUploads.abort
storage.multipartUploads.create
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
- Secret Access Key
- Corresponding key to the above access ID.
- GCS Bucket Name
- Make sure your GCS bucket is accessible from the machine running Airbyte. This depends on your networking setup. The easiest way to verify if Airbyte is able to connect to your GCS bucket is via the check connection tool in the UI.
Configuration
Parameter | Type | Notes |
---|---|---|
GCS Bucket Name | string | Name of the bucket to sync data into. |
GCS Bucket Path | string | Subdirectory under the above bucket to sync the data into. |
GCS Region | string | See here for all region codes. |
HMAC Key Access ID | string | HMAC key access ID . The access ID for the GCS bucket. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. See HMAC key for details. |
HMAC Key Secret | string | The corresponding secret for the access ID. It is a 40-character base-64 encoded string. |
Format | object | Format specific configuration. See below for details. |
Part Size | integer | Arg to configure a block size. Max allowed blocks by GCS = 10,000, i.e. max stream size = blockSize * 10,000 blocks. |
Currently, only the HMAC key is supported. More credential types will be added in the future, please submit an issue with your request.
Additionally, your bucket must be encrypted using a Google-managed encryption key (this is the default setting when creating a new bucket). We currently do not support buckets using customer-managed encryption keys (CMEK). You can view this setting under the "Configuration" tab of your GCS bucket, in the Encryption type
row.
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
The full path of the output data is:
<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>
For example:
testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv.gz
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | | format extension
| | | | | | partition id
| | | | | upload time in millis
| | | | upload date in YYYY-MM-DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name
Please note that the stream name may contain a prefix, if it is configured on the connection.
The rationales behind this naming pattern are: 1. Each stream has its own directory. 2. The data output files can be sorted by upload time. 3. The upload time composes of a date part and millis part so that it is both readable and unique.
A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .
Output Schema
Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
Avro
Apache Avro serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.
Configuration
Here is the available compression codecs:
- No compression
deflate
- Compression level
- Range
[0, 9]
. Default to 0. - Level 0: no compression & fastest.
- Level 9: best compression & slowest.
- Range
- Compression level
bzip2
xz
- Compression level
- Range
[0, 9]
. Default to 6. - Level 0-3 are fast with medium compression.
- Level 4-6 are fairly slow with high compression.
- Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
- Range
- Compression level
zstandard
- Compression level
- Range
[-5, 22]
. Default to 3. - Negative levels are 'fast' modes akin to
lz4
orsnappy
. - Levels above 9 are generally for archival purposes.
- Levels above 18 use a lot of memory.
- Range
- Include checksum
- If set to
true
, a checksum will be included in each data block.
- If set to
- Compression level
snappy
Data schema
Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.