COPY INTO — Bulk Loading

COPY INTO loads documents from external files into an Elasticsearch index using the Bulk API.

Syntax

COPY INTO table_name
FROM 'path/to/file'
[FILE_FORMAT = 'JSON' | 'JSON_ARRAY' | 'PARQUET' | 'DELTA_LAKE']
[ON CONFLICT (pk_column) DO UPDATE];

Behavior

COPY INTO performs:

Index name validation
Loading of the real Elasticsearch schema (mapping, primary key, partitioning)
Primary key extraction (_id generation, composite PK concatenation)
Partitioning extraction (suffix index name based on date)
Bulk ingestion via the Bulk API
Pipeline execution
Returns DmlResult with inserted and rejected counts

Supported File Formats

Format	Description
`JSON`	Newline-delimited JSON (one document per line)
`JSON_ARRAY`	JSON array of documents
`PARQUET`	Apache Parquet columnar format
`DELTA_LAKE`	Delta Lake table format

Full Example

Table Definition

CREATE TABLE IF NOT EXISTS copy_into_test (
  uuid KEYWORD NOT NULL,
  name VARCHAR,
  birthDate DATE,
  childrenCount INT,
  PRIMARY KEY (uuid)
);

Data File (`example_data.json`)

{"uuid": "A12", "name": "Homer Simpson", "birthDate": "1967-11-21", "childrenCount": 0}
{"uuid": "A14", "name": "Moe Szyslak", "birthDate": "1967-11-21", "childrenCount": 0}
{"uuid": "A16", "name": "Barney Gumble", "birthDate": "1969-05-09", "childrenCount": 2}

COPY INTO Statement

COPY INTO copy_into_test
FROM 's3://my-bucket/path/to/example_data.json'
FILE_FORMAT = 'JSON'
ON CONFLICT (uuid) DO UPDATE;

PK = uuid → _id = uuid
ON CONFLICT DO UPDATE → Bulk upsert
Table pipeline is applied
Returns DmlResult(inserted = 3, rejected = 0)

Remote File System Support

COPY INTO transparently supports remote file systems by auto-detecting the URI scheme. No SQL syntax change is required.

URI scheme	File system	Required JAR
`s3a://` or `s3://`	AWS S3	`hadoop-aws`
`abfs://`, `abfss://`, `wasb://`, `wasbs://`	Azure ADLS Gen2 / Blob Storage	`hadoop-azure`
`gs://`	Google Cloud Storage	`gcs-connector-hadoop3`
`hdfs://`	HDFS	(bundled with hadoop-client)
(local path)	Local filesystem	(no extra JAR needed)

Credentials Configuration

Authentication is resolved automatically from standard environment variables.

AWS S3

AWS_ACCESS_KEY_ID         # access key (falls back to DefaultAWSCredentialsProviderChain)
AWS_SECRET_ACCESS_KEY     # secret key
AWS_SESSION_TOKEN         # session token (optional)
AWS_REGION                # region (or AWS_DEFAULT_REGION)
AWS_ENDPOINT_URL          # custom endpoint for S3-compatible stores (MinIO, LocalStack, ...)

Azure ADLS Gen2 / Blob Storage

AZURE_STORAGE_ACCOUNT_NAME   # storage account name
AZURE_STORAGE_ACCOUNT_KEY    # shared key (Option 1)
AZURE_CLIENT_ID              # service principal client ID   (Option 2 — OAuth2)
AZURE_CLIENT_SECRET          # service principal secret      (Option 2 — OAuth2)
AZURE_TENANT_ID              # Azure tenant ID               (Option 2 — OAuth2)
AZURE_STORAGE_SAS_TOKEN      # SAS token                     (Option 3)

Google Cloud Storage

GOOGLE_APPLICATION_CREDENTIALS   # path to service-account JSON key file
GOOGLE_CLOUD_PROJECT             # GCS project ID (optional)

Falls back to Application Default Credentials (Workload Identity, gcloud auth, …) when the variable is absent.

HDFS

HADOOP_CONF_DIR    # directory containing core-site.xml and hdfs-site.xml
HADOOP_USER_NAME   # Hadoop user name (optional)

Per-user Hadoop Overrides

Any *.xml file placed in ~/.softclient4es/ is loaded on top of the auto-detected configuration:

<configuration>
  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>200</value>
  </property>
</configuration>

Version Compatibility

Feature	ES6	ES7	ES8	ES9
COPY INTO	Yes	Yes	Yes	Yes
JSON	Yes	Yes	Yes	Yes
JSON_ARRAY	Yes	Yes	Yes	Yes
PARQUET	Yes	Yes	Yes	Yes
DELTA_LAKE	Yes	Yes	Yes	Yes