COPY INTO — Bulk Loading
COPY INTO loads documents from external files into an Elasticsearch index using the Bulk API.
Syntax
COPY INTO table_nameFROM 'path/to/file'[FILE_FORMAT = 'JSON' | 'JSON_ARRAY' | 'PARQUET' | 'DELTA_LAKE'][ON CONFLICT (pk_column) DO UPDATE];Behavior
COPY INTO performs:
- Index name validation
- Loading of the real Elasticsearch schema (mapping, primary key, partitioning)
- Primary key extraction (
_idgeneration, composite PK concatenation) - Partitioning extraction (suffix index name based on date)
- Bulk ingestion via the Bulk API
- Pipeline execution
- Returns
DmlResultwithinsertedandrejectedcounts
Supported File Formats
| Format | Description |
|---|---|
JSON | Newline-delimited JSON (one document per line) |
JSON_ARRAY | JSON array of documents |
PARQUET | Apache Parquet columnar format |
DELTA_LAKE | Delta Lake table format |
Full Example
Table Definition
CREATE TABLE IF NOT EXISTS copy_into_test ( uuid KEYWORD NOT NULL, name VARCHAR, birthDate DATE, childrenCount INT, PRIMARY KEY (uuid));Data File (example_data.json)
{"uuid": "A12", "name": "Homer Simpson", "birthDate": "1967-11-21", "childrenCount": 0}{"uuid": "A14", "name": "Moe Szyslak", "birthDate": "1967-11-21", "childrenCount": 0}{"uuid": "A16", "name": "Barney Gumble", "birthDate": "1969-05-09", "childrenCount": 2}COPY INTO Statement
COPY INTO copy_into_testFROM 's3://my-bucket/path/to/example_data.json'FILE_FORMAT = 'JSON'ON CONFLICT (uuid) DO UPDATE;- PK =
uuid→_id = uuid - ON CONFLICT DO UPDATE → Bulk upsert
- Table pipeline is applied
- Returns
DmlResult(inserted = 3, rejected = 0)
Remote File System Support
COPY INTO transparently supports remote file systems by auto-detecting the URI scheme. No SQL syntax change is required.
| URI scheme | File system | Required JAR |
|---|---|---|
s3a:// or s3:// | AWS S3 | hadoop-aws |
abfs://, abfss://, wasb://, wasbs:// | Azure ADLS Gen2 / Blob Storage | hadoop-azure |
gs:// | Google Cloud Storage | gcs-connector-hadoop3 |
hdfs:// | HDFS | (bundled with hadoop-client) |
| (local path) | Local filesystem | (no extra JAR needed) |
Credentials Configuration
Authentication is resolved automatically from standard environment variables.
AWS S3
AWS_ACCESS_KEY_ID # access key (falls back to DefaultAWSCredentialsProviderChain)AWS_SECRET_ACCESS_KEY # secret keyAWS_SESSION_TOKEN # session token (optional)AWS_REGION # region (or AWS_DEFAULT_REGION)AWS_ENDPOINT_URL # custom endpoint for S3-compatible stores (MinIO, LocalStack, ...)Azure ADLS Gen2 / Blob Storage
AZURE_STORAGE_ACCOUNT_NAME # storage account nameAZURE_STORAGE_ACCOUNT_KEY # shared key (Option 1)AZURE_CLIENT_ID # service principal client ID (Option 2 — OAuth2)AZURE_CLIENT_SECRET # service principal secret (Option 2 — OAuth2)AZURE_TENANT_ID # Azure tenant ID (Option 2 — OAuth2)AZURE_STORAGE_SAS_TOKEN # SAS token (Option 3)Google Cloud Storage
GOOGLE_APPLICATION_CREDENTIALS # path to service-account JSON key fileGOOGLE_CLOUD_PROJECT # GCS project ID (optional)Falls back to Application Default Credentials (Workload Identity, gcloud auth, …) when the variable is absent.
HDFS
HADOOP_CONF_DIR # directory containing core-site.xml and hdfs-site.xmlHADOOP_USER_NAME # Hadoop user name (optional)Per-user Hadoop Overrides
Any *.xml file placed in ~/.softclient4es/ is loaded on top of the auto-detected configuration:
<configuration> <property> <name>fs.s3a.connection.maximum</name> <value>200</value> </property></configuration>Version Compatibility
| Feature | ES6 | ES7 | ES8 | ES9 |
|---|---|---|---|---|
| COPY INTO | Yes | Yes | Yes | Yes |
| JSON | Yes | Yes | Yes | Yes |
| JSON_ARRAY | Yes | Yes | Yes | Yes |
| PARQUET | Yes | Yes | Yes | Yes |
| DELTA_LAKE | Yes | Yes | Yes | Yes |