The Attribute Extraction operation maps data from a source record into a predefined target schema using an AI language model (OpenAI / Azure OpenAI). It is particularly useful when source and target structures use different attribute names or when the mapping requires content-based interpretation.
Prerequisite: An OPENAI_PROVIDER with the use case Attribute Extraction must be configured.
1. Creating an Attribute Extraction
The operation is added to the flow like any other operation — drag it from the graphical flow editor onto the canvas and connect it to a data source.
2. Configuration
Basic Settings
| Field | Description |
|---|---|
| Name | Label for the operation |
| Schema (Target Schema) | The schema that defines the target attributes with their keys, descriptions, and optional fixed values |
| Filter result to schema attributes | When enabled, only attributes defined in the target schema are kept in the result. Any additional attributes the AI may add are removed |
| Comment | Optional free text |
Advanced
| Field | Description |
|---|---|
| Preserve Source Attributes (Prefix) | If a prefix is provided, all source attributes are additionally included in the result record as <prefix>_<attributeName> |
| System Prompt | Optional Handlebars template that defines the AI’s role, the input format, and the output rules. Leave empty to use the built-in default (see below) |
| User Prompt | Optional Handlebars template that carries the per-record payload (target attributes + source record). Leave empty to use the built-in default (see below) |
The split between system and user prompt exists for two reasons: it lets you override the rules without rewriting the per-record payload, and it lets the OpenAI prompt cache reuse the static prefix across records of the same run.
Target Format
A separate Target Format tab shows the JSON that is substituted into the {{targetFormat}} Handlebars variable. By default it is generated at runtime from the target schema; if you want a fixed, frozen version (or want to add or remove individual entries by hand), edit the JSON in the editor and it is stored on the operation. The button next to the editor regenerates the JSON from the current target schema, overwriting the editor content. Leave the editor empty to fall back to runtime generation.
When Target Format is set, Filter result to schema attributes uses the keys from the target-format JSON (not the schema’s keys) — so adding a key here lets it pass the filter, removing one strips it from the output.
Source Attributes
Optional filter: only the selected attributes from the source record are passed to the AI. Leave empty to pass all attributes. Fewer attributes means a shorter prompt — this improves quality and speed.
How the AI uses the Target Schema
For each attribute in the target schema the following fields are passed to the AI:
key– the target attribute key (used as the property name in the output JSON)description– name and description of the attribute combined (e.g."Color — The main color of the product"). If name and description are identical or one is empty, only the available text is useddefaultValue– if set, passed to the AI as a hintfixedValues– if set, the AI may only use values from this listmandatory– iftrue, the AI must include the attribute in the output. If it cannot extract a value, it falls back todefaultValue, otherwise an empty stringpattern– if set, the AI must reformat the extracted value so that it matches this regular expression
3. Default Prompts
If no custom prompt is configured, the operation uses two complementary prompts. Each can be overridden independently — leave the corresponding field blank to fall back to the default below.
System Prompt (Role, Input Contract, Output Rules)
You are a precise data extraction assistant.
You extract structured attributes from a single source data record according to a list of target attributes.
You will receive a user message with two JSON blocks:
1. "Target attributes" — a JSON array describing what to extract.
2. "Source record" — a JSON object containing the raw data.
Each target attribute is an object with these fields:
- "key" (required): the property name to use in your output.
- "description" (required): what this attribute represents and what to look for in the source record.
- "fixedValues" (optional): a list of allowed values. If present, the output MUST be exactly one of these values, copied verbatim.
- "defaultValue" (optional): the value to use when the source record contains no usable information for this attribute.
- "mandatory" (optional, default false): when true, the attribute MUST appear in the output. If you cannot extract a value, fall back to "defaultValue" if given, otherwise use an empty string. Do not invent data.
- "pattern" (optional): a regular expression the output value must match. Reformat the extracted value if needed so it satisfies the pattern.
Output rules:
- Respond with a single flat JSON object — no markdown fences, no commentary, no wrapping array.
- Each property name in the output is a "key" from the target attributes.
- Map as many attributes as possible. Non-mandatory attributes with no source information may be omitted.
- Never fabricate values not supported by the source record. When in doubt, omit (or, for mandatory attributes, fall back as described above).
User Prompt (Per-record Payload, Handlebars Template)
Target attributes:
"""{{targetFormat}}"""
Source record:
"""{{sourceRecord}}"""
The variables {{targetFormat}} and {{sourceRecord}} are filled at runtime with the (condensed) schema data and the filtered source record respectively. Both prompts go through the same Handlebars context, so you can move variables between them as needed.
The minimal user prompt is intentional: keeping the static rules in the system message and only the per-record payload in the user message lets OpenAI’s prompt cache reuse the system prefix across all records of the same run.
4. Preview
Before running the operation against the full dataset, you can verify your configuration with a Preview. The preview is triggered manually via the button next to the search bar at the top of the source pane — it does not run automatically when you open the expand view, because each AI call is slow and expensive.
When triggered:
- The backend extracts attributes for only the first two records and returns the remaining records as untouched placeholders, so the preview grid keeps the same row count and hover-highlight still aligns with the source pane.
- The preview always bypasses the prompt cache, so it never reads stale answers and never poisons the cache with aborted runs.
- Failed records (e.g. when the prompt is too large for the configured context window, or the provider returns an error) come back with an exclamation-mark indicator in the first column. Hovering it shows the error message.
- Placeholder rows beyond the two-record limit show no validation indicator at all — they were not processed, so there is nothing to display.
The preview lets you catch issues before a real run: a wrong schema reference, a custom prompt that no longer fits the context window, an unreachable provider, or a targetFormat override with broken JSON.
5. Execution
After configuration, the operation is started via the Extract Attributes button. The AI processes each source record individually and produces an output record with the target attributes populated.
The execution card header shows the current GPT status of the configured provider:
| Status | Meaning |
|---|---|
| available | The provider is ready |
| rate limited | The request quota is exhausted; reset in X seconds |
6. Result
The output record contains the attributes defined by the target schema, populated with values extracted from the source record. Attributes for which no matching value was found may be empty or set to the defaultValue.
If Preserve Source Attributes is configured, all original source attributes are additionally included in the result under the specified prefix.
If Filter result to schema attributes is enabled, only attributes whose keys are defined in the target schema are kept in the result.
7. Example
The following example shows how unstructured cat descriptions can be mapped to a clean target schema using the AI.
Source data
Free-text descriptions of cats. Each record only contains the cat’s name and a single descriptive paragraph that mixes breed, color, age, body length and image URL in natural language.
Source dataset: Download JSON
| catName | catDescription |
|---|---|
| Whisker | A playful Siamese with a medium body length of 43cm, cream-colored coat, and dark chocolate points. Aged 2 years. URL: https://res.cloudinary.com/.../cat01.png |
| Mittens | A cuddly Maine Coon, large(55cm) and fluffy with a gray tabby pattern and white paws. Aged 3 years. URL: https://res.cloudinary.com/.../cat02.png |
| Pumpkin | An affectionate Abyssinian with a sleek, medium-length cinnamon coat(around 410 mm give or take). Aged 4 years. URL: https://res.cloudinary.com/.../cat03.png |
| Shadow | A stealthy Bombay cat with a short, jet-black coat and bright yellow eyes. Aged 1 year. URL: https://res.cloudinary.com/.../cat04.png |
| Pearl | A dainty Turkish Angora with a long, silky white coat and heterochromatic eyes. Aged 6 years. Unfortunately we do not have an image for this cat. |
Target schema
Before the operation can run, a schema must exist that defines the target attributes. The Attribute Extraction operation will reference this schema as its Schema (Target Schema).
Target schema definition: Download JSON
The schema contains the following attributes:
| key | description | fixedValues | defaultValue |
|---|---|---|---|
| id | the unique id of the cat | — | — |
| breed | the cat’s breed | — | — |
| age | the age of the cat in years, a number between 0 and 20 | — | — |
| color | the color of the cat’s fur | BLUE, BLACK, GRAY, WHITE, GINGER, BROWN | — |
| bodyLength | the length of the cat’s body in cm | — | — |
| image | the url of the cat’s image in jpeg format | — | https://picsum.photos/id/219/5000/3333 |
| shortDescription | summary of the description, max 200 Characters | — | — |
Notice how the schema does the heavy lifting:
coloris restricted viafixedValues, so the AI must pick one of the allowed values (e.g. the source value “cream-colored” is normalized toWHITE).bodyLengthdescribes the unit (cm), so the AI converts “around 410 mm” to41.imageprovides adefaultValue, which is used when the source description does not contain a URL (e.g. for Pearl).shortDescriptioninstructs the AI to summarize the full description.
Expected result
After running the operation against the source data with the schema above, each output record contains the schema’s attributes filled from the natural-language description:
| id | breed | age | color | bodyLength | image | shortDescription |
|---|---|---|---|---|---|---|
| Whisker | Siamese | 2 | WHITE | 43 | https://res.cloudinary.com/.../cat01.png | Playful Siamese, cream coat with dark chocolate points |
| Mittens | Maine Coon | 3 | GRAY | 55 | https://res.cloudinary.com/.../cat02.png | Cuddly Maine Coon, gray tabby with white paws |
| Pumpkin | Abyssinian | 4 | BROWN | 41 | https://res.cloudinary.com/.../cat03.png | Affectionate Abyssinian with cinnamon coat |
| Shadow | Bombay | 1 | BLACK | — | https://res.cloudinary.com/.../cat04.png | Stealthy Bombay with jet-black coat, yellow eyes |
| Pearl | Turkish Angora | 6 | WHITE | — | https://picsum.photos/id/219/5000/3333 | Dainty Turkish Angora with silky white coat |
Exact values may vary slightly between runs, since the result is produced by the AI model.
8. Data Quality Index (DQI)
After an Attribute Extraction run, the result record list is scored with a Data Quality Index (DQI) — a single number from 0 to 100 that indicates how well the extracted records match the target schema. The DQI lets you compare runs after changing the model, the prompt, or the schema.
The score is shown in the Info Metrics card of the result and is also computed per attribute in the attribute metrics block.
What DQI measures
DQI combines two structural axes per attribute:
- Fill rate — share of records where the attribute is filled.
fillRate = 1 − nulls / totalRecords. - Conformity rate — how well the filled values match the schema’s constraints (currently required and value-list; more rules may be added in the future).
DQI per attribute:
dqi_a = fillRate_a · conformityRate_a
Overall DQI (across all schema attributes), scaled to 0–100:
_chioroDqi = 100 · Σ_a (weight_a · dqi_a) / Σ_a weight_a
weight_a is 2.0 if the attribute is mandatory in the schema, or if Filter result to schema attributes is on and the attribute has a value-list. Otherwise weight_a = 1.0. The weight does not compound past 2.0.
How conformity is computed
Conformity is a product over per-rule scores, where each rule’s contribution is weighted by an exponent:
conformityRate_a = Π_r score_r ^ ruleWeight_r
A higher rule weight makes that rule’s failures compound more sharply. Two rules are evaluated today, both with rule weight 2.0:
| Rule | When it applies | Score |
|---|---|---|
required |
Attribute is mandatory in the schema | filled / totalRecords (nulls violate required) |
valueList |
Schema defines a value-list, at least one record is filled | inListCount / filled |
If no rule applies (free text, no value-list, not mandatory), the conformity rate stays at 1.0.
Example. A mandatory attribute without a value-list, filled in 90% of records:
fillRate = 0.9requiredrule score =0.9, weight2.0→conformityRate = 0.9² = 0.81dqi_a = 0.9 · 0.81 = 0.729
What DQI is not
DQI is a structural quality heuristic. It does not judge whether the content of an extracted value is semantically correct — only whether the value is present and conforms to schema rules. A high DQI with a misleading prompt is still possible: the attributes look filled and in-list, but the values may still be wrong. Always spot-check a preview before trusting a high DQI.
Type, range, length, and pattern conformity are not yet evaluated. They may be added as additional rules later — the formula already accommodates them without changes to existing data.
9. Notes
- AI processing is sequential per record. For large datasets with active rate limits, the operation automatically waits until the quota is available again.
- If no OPENAI_PROVIDER with the use case Attribute Extraction is configured, the execution fails with a corresponding error message.
- If no Schema (target schema) is configured, the execution also fails with an error.
- The GPT status in the UI reflects the server-side observed state since the last backend start — after a restart the status is initially always shown as “available”.
- AI responses are cached per provider. Identical prompts (same source record, same schema, same prompt text) return the cached result immediately without calling the API again.