The Attribute Extraction operation maps data from a source record into a predefined target schema using an AI language model (OpenAI / Azure OpenAI). It is particularly useful when source and target structures use different attribute names or when the mapping requires content-based interpretation.

Prerequisite: An OPENAI_PROVIDER with the use case Attribute Extraction must be configured.

1. Creating an Attribute Extraction

The operation is added to the flow like any other operation — drag it from the graphical flow editor onto the canvas and connect it to a data source.

2. Configuration

Basic Settings

Field	Description
Name	Label for the operation
Schema (Target Schema)	The schema that defines the target attributes with their keys, descriptions, and optional fixed values
Filter result to schema attributes	When enabled, only attributes defined in the target schema are kept in the result. Any additional attributes the AI may add are removed
Comment	Optional free text

Advanced

Field	Description
Preserve Source Attributes (Prefix)	If a prefix is provided, all source attributes are additionally included in the result record as `<prefix>_<attributeName>`
System Prompt	Optional Handlebars template that defines the AI’s role, the input format, and the output rules. Leave empty to use the built-in default (see below)
User Prompt	Optional Handlebars template that carries the per-record payload (target attributes + source record). Leave empty to use the built-in default (see below)

The split between system and user prompt exists for two reasons: it lets you override the rules without rewriting the per-record payload, and it lets the OpenAI prompt cache reuse the static prefix across records of the same run.

Target Format

A separate Target Format tab shows the JSON that is substituted into the {{targetFormat}} Handlebars variable. By default it is generated at runtime from the target schema; if you want a fixed, frozen version (or want to add or remove individual entries by hand), edit the JSON in the editor and it is stored on the operation. The button next to the editor regenerates the JSON from the current target schema, overwriting the editor content. Leave the editor empty to fall back to runtime generation.

When Target Format is set, Filter result to schema attributes uses the keys from the target-format JSON (not the schema’s keys) — so adding a key here lets it pass the filter, removing one strips it from the output.

Source Attributes

Optional filter: only the selected attributes from the source record are passed to the AI. Leave empty to pass all attributes. Fewer attributes means a shorter prompt — this improves quality and speed.

How the AI uses the Target Schema

For each attribute in the target schema the following fields are passed to the AI:

key – the target attribute key (used as the property name in the output JSON)
description – name and description of the attribute combined (e.g. "Color — The main color of the product"). If name and description are identical or one is empty, only the available text is used
defaultValue – if set, passed to the AI as a hint
fixedValues – if set, the AI may only use values from this list
mandatory – if true, the AI must include the attribute in the output. If it cannot extract a value, it falls back to defaultValue, otherwise an empty string
pattern – if set, the AI must reformat the extracted value so that it matches this regular expression

3. Default Prompts

If no custom prompt is configured, the operation uses two complementary prompts. Each can be overridden independently — leave the corresponding field blank to fall back to the default below.

System Prompt (Role, Input Contract, Output Rules)

You are a precise data extraction assistant.
You extract structured attributes from a single source data record according to a list of target attributes.

You will receive a user message with two JSON blocks:
1. "Target attributes" — a JSON array describing what to extract.
2. "Source record" — a JSON object containing the raw data.

Each target attribute is an object with these fields:
- "key" (required): the property name to use in your output.
- "description" (required): what this attribute represents and what to look for in the source record.
- "fixedValues" (optional): a list of allowed values. If present, the output MUST be exactly one of these values, copied verbatim.
- "defaultValue" (optional): the value to use when the source record contains no usable information for this attribute.
- "mandatory" (optional, default false): when true, the attribute MUST appear in the output. If you cannot extract a value, fall back to "defaultValue" if given, otherwise use an empty string. Do not invent data.
- "pattern" (optional): a regular expression the output value must match. Reformat the extracted value if needed so it satisfies the pattern.

Output rules:
- Respond with a single flat JSON object — no markdown fences, no commentary, no wrapping array.
- Each property name in the output is a "key" from the target attributes.
- Map as many attributes as possible. Non-mandatory attributes with no source information may be omitted.
- Never fabricate values not supported by the source record. When in doubt, omit (or, for mandatory attributes, fall back as described above).

User Prompt (Per-record Payload, Handlebars Template)

Target attributes:
"""{{targetFormat}}"""

Source record:
"""{{sourceRecord}}"""

The variables {{targetFormat}} and {{sourceRecord}} are filled at runtime with the (condensed) schema data and the filtered source record respectively. Both prompts go through the same Handlebars context, so you can move variables between them as needed.

The minimal user prompt is intentional: keeping the static rules in the system message and only the per-record payload in the user message lets OpenAI’s prompt cache reuse the system prefix across all records of the same run.

4. Preview

Before running the operation against the full dataset, you can verify your configuration with a Preview. The preview is triggered manually via the button next to the search bar at the top of the source pane — it does not run automatically when you open the expand view, because each AI call is slow and expensive.

When triggered:

The backend extracts attributes for only the first two records and returns the remaining records as untouched placeholders, so the preview grid keeps the same row count and hover-highlight still aligns with the source pane.
The preview always bypasses the prompt cache, so it never reads stale answers and never poisons the cache with aborted runs.
Failed records (e.g. when the prompt is too large for the configured context window, or the provider returns an error) come back with an exclamation-mark indicator in the first column. Hovering it shows the error message.
Placeholder rows beyond the two-record limit show no validation indicator at all — they were not processed, so there is nothing to display.

The preview lets you catch issues before a real run: a wrong schema reference, a custom prompt that no longer fits the context window, an unreachable provider, or a targetFormat override with broken JSON.

5. Execution

After configuration, the operation is started via the Extract Attributes button. The AI processes each source record individually and produces an output record with the target attributes populated.

The execution card header shows the current GPT status of the configured provider:

Status	Meaning
available	The provider is ready
rate limited	The request quota is exhausted; reset in X seconds

6. Result

The output record contains the attributes defined by the target schema, populated with values extracted from the source record. Attributes for which no matching value was found may be empty or set to the defaultValue.

If Preserve Source Attributes is configured, all original source attributes are additionally included in the result under the specified prefix.

If Filter result to schema attributes is enabled, only attributes whose keys are defined in the target schema are kept in the result.

7. Example

The following example shows how unstructured cat descriptions can be mapped to a clean target schema using the AI.

Source data

Free-text descriptions of cats. Each record only contains the cat’s name and a single descriptive paragraph that mixes breed, color, age, body length and image URL in natural language.

Source dataset: Download JSON

catName	catDescription
Whisker	A playful Siamese with a medium body length of 43cm, cream-colored coat, and dark chocolate points. Aged 2 years. URL: https://res.cloudinary.com/.../cat01.png
Mittens	A cuddly Maine Coon, large(55cm) and fluffy with a gray tabby pattern and white paws. Aged 3 years. URL: https://res.cloudinary.com/.../cat02.png
Pumpkin	An affectionate Abyssinian with a sleek, medium-length cinnamon coat(around 410 mm give or take). Aged 4 years. URL: https://res.cloudinary.com/.../cat03.png
Shadow	A stealthy Bombay cat with a short, jet-black coat and bright yellow eyes. Aged 1 year. URL: https://res.cloudinary.com/.../cat04.png
Pearl	A dainty Turkish Angora with a long, silky white coat and heterochromatic eyes. Aged 6 years. Unfortunately we do not have an image for this cat.

Target schema

Before the operation can run, a schema must exist that defines the target attributes. The Attribute Extraction operation will reference this schema as its Schema (Target Schema).

Target schema definition: Download JSON

The schema contains the following attributes:

key	description	fixedValues	defaultValue
id	the unique id of the cat	—	—
breed	the cat’s breed	—	—
age	the age of the cat in years, a number between 0 and 20	—	—
color	the color of the cat’s fur	BLUE, BLACK, GRAY, WHITE, GINGER, BROWN	—
bodyLength	the length of the cat’s body in cm	—	—
image	the url of the cat’s image in jpeg format	—	https://picsum.photos/id/219/5000/3333
shortDescription	summary of the description, max 200 Characters	—	—

Notice how the schema does the heavy lifting:

color is restricted via fixedValues, so the AI must pick one of the allowed values (e.g. the source value “cream-colored” is normalized to WHITE).
bodyLength describes the unit (cm), so the AI converts “around 410 mm” to 41.
image provides a defaultValue, which is used when the source description does not contain a URL (e.g. for Pearl).
shortDescription instructs the AI to summarize the full description.

Expected result

After running the operation against the source data with the schema above, each output record contains the schema’s attributes filled from the natural-language description:

id	breed	age	color	bodyLength	image	shortDescription
Whisker	Siamese	2	WHITE	43	https://res.cloudinary.com/.../cat01.png	Playful Siamese, cream coat with dark chocolate points
Mittens	Maine Coon	3	GRAY	55	https://res.cloudinary.com/.../cat02.png	Cuddly Maine Coon, gray tabby with white paws
Pumpkin	Abyssinian	4	BROWN	41	https://res.cloudinary.com/.../cat03.png	Affectionate Abyssinian with cinnamon coat
Shadow	Bombay	1	BLACK	—	https://res.cloudinary.com/.../cat04.png	Stealthy Bombay with jet-black coat, yellow eyes
Pearl	Turkish Angora	6	WHITE	—	https://picsum.photos/id/219/5000/3333	Dainty Turkish Angora with silky white coat

Exact values may vary slightly between runs, since the result is produced by the AI model.

8. Data Quality Index (DQI)

After an Attribute Extraction run, the result record list is scored with a Data Quality Index (DQI) — a single number from 0 to 100 that indicates how well the extracted records match the target schema. The DQI lets you compare runs after changing the model, the prompt, or the schema.

The score is shown in the Info Metrics card of the result and is also computed per attribute in the attribute metrics block.

What DQI measures

DQI combines two structural axes per attribute:

Fill rate — share of records where the attribute is filled. fillRate = 1 − nulls / totalRecords.
Conformity rate — how well the filled values match the schema’s constraints (currently required and value-list; more rules may be added in the future).

DQI per attribute:

dqi_a = fillRate_a · conformityRate_a

Overall DQI (across all schema attributes), scaled to 0–100:

_chioroDqi = 100 · Σ_a (weight_a · dqi_a) / Σ_a weight_a

weight_a is 2.0 if the attribute is mandatory in the schema, or if Filter result to schema attributes is on and the attribute has a value-list. Otherwise weight_a = 1.0. The weight does not compound past 2.0.

How conformity is computed

Conformity is a product over per-rule scores, where each rule’s contribution is weighted by an exponent:

conformityRate_a = Π_r  score_r ^ ruleWeight_r

A higher rule weight makes that rule’s failures compound more sharply. Two rules are evaluated today, both with rule weight 2.0:

Rule	When it applies	Score
`required`	Attribute is mandatory in the schema	`filled / totalRecords` (nulls violate `required`)
`valueList`	Schema defines a value-list, at least one record is filled	`inListCount / filled`

If no rule applies (free text, no value-list, not mandatory), the conformity rate stays at 1.0.

Example. A mandatory attribute without a value-list, filled in 90% of records:

fillRate = 0.9
required rule score = 0.9, weight 2.0 → conformityRate = 0.9² = 0.81
dqi_a = 0.9 · 0.81 = 0.729

What DQI is not

DQI is a structural quality heuristic. It does not judge whether the content of an extracted value is semantically correct — only whether the value is present and conforms to schema rules. A high DQI with a misleading prompt is still possible: the attributes look filled and in-list, but the values may still be wrong. Always spot-check a preview before trusting a high DQI.

Type, range, length, and pattern conformity are not yet evaluated. They may be added as additional rules later — the formula already accommodates them without changes to existing data.

9. Notes

AI processing is sequential per record. For large datasets with active rate limits, the operation automatically waits until the quota is available again.
If no OPENAI_PROVIDER with the use case Attribute Extraction is configured, the execution fails with a corresponding error message.
If no Schema (target schema) is configured, the execution also fails with an error.
The GPT status in the UI reflects the server-side observed state since the last backend start — after a restart the status is initially always shown as “available”.
AI responses are cached per provider. Identical prompts (same source record, same schema, same prompt text) return the cached result immediately without calling the API again.

Operations ‣ Attribute Extraction

#1. Creating an Attribute Extraction

#2. Configuration

#Basic Settings

#Advanced

#Target Format

#Source Attributes

#How the AI uses the Target Schema

#3. Default Prompts

#System Prompt (Role, Input Contract, Output Rules)

#User Prompt (Per-record Payload, Handlebars Template)

#4. Preview

#5. Execution

#6. Result

#7. Example

#Source data

#Target schema

#Expected result

#8. Data Quality Index (DQI)

#What DQI measures

#How conformity is computed

#What DQI is not

#9. Notes