Skip to content

COPY INTO — Bulk Loading

COPY INTO loads documents from external files into an Elasticsearch index using the Bulk API.

Syntax

COPY INTO table_name
FROM 'path/to/file'
[FILE_FORMAT = 'JSON' | 'JSON_ARRAY' | 'PARQUET' | 'DELTA_LAKE']
[ON CONFLICT (pk_column) DO UPDATE];

Behavior

COPY INTO performs:

  1. Index name validation
  2. Loading of the real Elasticsearch schema (mapping, primary key, partitioning)
  3. Primary key extraction (_id generation, composite PK concatenation)
  4. Partitioning extraction (suffix index name based on date)
  5. Bulk ingestion via the Bulk API
  6. Pipeline execution
  7. Returns DmlResult with inserted and rejected counts

Supported File Formats

FormatDescription
JSONNewline-delimited JSON (one document per line)
JSON_ARRAYJSON array of documents
PARQUETApache Parquet columnar format
DELTA_LAKEDelta Lake table format

Full Example

Table Definition

CREATE TABLE IF NOT EXISTS copy_into_test (
uuid KEYWORD NOT NULL,
name VARCHAR,
birthDate DATE,
childrenCount INT,
PRIMARY KEY (uuid)
);

Data File (example_data.json)

{"uuid": "A12", "name": "Homer Simpson", "birthDate": "1967-11-21", "childrenCount": 0}
{"uuid": "A14", "name": "Moe Szyslak", "birthDate": "1967-11-21", "childrenCount": 0}
{"uuid": "A16", "name": "Barney Gumble", "birthDate": "1969-05-09", "childrenCount": 2}

COPY INTO Statement

COPY INTO copy_into_test
FROM 's3://my-bucket/path/to/example_data.json'
FILE_FORMAT = 'JSON'
ON CONFLICT (uuid) DO UPDATE;
  • PK = uuid_id = uuid
  • ON CONFLICT DO UPDATE → Bulk upsert
  • Table pipeline is applied
  • Returns DmlResult(inserted = 3, rejected = 0)

Remote File System Support

COPY INTO transparently supports remote file systems by auto-detecting the URI scheme. No SQL syntax change is required.

URI schemeFile systemRequired JAR
s3a:// or s3://AWS S3hadoop-aws
abfs://, abfss://, wasb://, wasbs://Azure ADLS Gen2 / Blob Storagehadoop-azure
gs://Google Cloud Storagegcs-connector-hadoop3
hdfs://HDFS(bundled with hadoop-client)
(local path)Local filesystem(no extra JAR needed)

Credentials Configuration

Authentication is resolved automatically from standard environment variables.

AWS S3

AWS_ACCESS_KEY_ID # access key (falls back to DefaultAWSCredentialsProviderChain)
AWS_SECRET_ACCESS_KEY # secret key
AWS_SESSION_TOKEN # session token (optional)
AWS_REGION # region (or AWS_DEFAULT_REGION)
AWS_ENDPOINT_URL # custom endpoint for S3-compatible stores (MinIO, LocalStack, ...)

Azure ADLS Gen2 / Blob Storage

AZURE_STORAGE_ACCOUNT_NAME # storage account name
AZURE_STORAGE_ACCOUNT_KEY # shared key (Option 1)
AZURE_CLIENT_ID # service principal client ID (Option 2 — OAuth2)
AZURE_CLIENT_SECRET # service principal secret (Option 2 — OAuth2)
AZURE_TENANT_ID # Azure tenant ID (Option 2 — OAuth2)
AZURE_STORAGE_SAS_TOKEN # SAS token (Option 3)

Google Cloud Storage

GOOGLE_APPLICATION_CREDENTIALS # path to service-account JSON key file
GOOGLE_CLOUD_PROJECT # GCS project ID (optional)

Falls back to Application Default Credentials (Workload Identity, gcloud auth, …) when the variable is absent.

HDFS

HADOOP_CONF_DIR # directory containing core-site.xml and hdfs-site.xml
HADOOP_USER_NAME # Hadoop user name (optional)

Per-user Hadoop Overrides

Any *.xml file placed in ~/.softclient4es/ is loaded on top of the auto-detected configuration:

~/.softclient4es/s3a-override.xml
<configuration>
<property>
<name>fs.s3a.connection.maximum</name>
<value>200</value>
</property>
</configuration>

Version Compatibility

FeatureES6ES7ES8ES9
COPY INTOYesYesYesYes
JSONYesYesYesYes
JSON_ARRAYYesYesYesYes
PARQUETYesYesYesYes
DELTA_LAKEYesYesYesYes