DFS Pro Fusion Tasks
Fusion tasks combine data from multiple datasets. Use them when one workflow needs to match, merge, compare, enrich, or reconcile records from different sources.
Examples:
- combine sensor history with maintenance history;
- match inspection findings to work orders;
- align source-system asset IDs with FactVerse assets;
- merge operational records from several sites;
- prepare a reviewed output dataset for predictive maintenance or AI Agent workflows.
Prerequisites
Before creating a fusion task, confirm:
- input datasets exist and are accessible;
- each input dataset has a known steward or source owner;
- key fields, timestamp fields, and identity fields are understood;
- expected output dataset name and owner are defined;
- matching mode and review threshold are agreed;
- reviewers are available when the task can produce conflicts or low-confidence results.
Fusion task flow
Open Data Fusion
Go to:
Data Integration > Data Fusion
The page shows fusion tasks, mode, status, output dataset, and run actions.
Fusion modes
| Mode | Use when |
|---|---|
| Rule Matching | Matching logic is deterministic, such as same asset ID, same timestamp window, or known key columns. |
| Semantic Matching | Names, aliases, descriptions, or relationships need to be compared. |
| LLM Assisted | The task needs language-based assistance and every uncertain result will be reviewed. |
Use rule matching first when stable keys are available. Use semantic or LLM-assisted modes when source records use different names, aliases, or descriptions.
Large-source fusion controls
Large operational datasets can be fused through background execution instead of a browser-sized result transfer. For supported methods such as merge_by_natural_key, DFS Pro can run asynchronously, stream source records in chunks, and persist output rows directly into the target dataset.
In the UI this still appears as one fusion run. The run may stay in queued or running status while source rows are processed. Review the run history after completion for totals, conflict counts, persisted row counts, and any error message.
Use this path when:
- source tables are too large for preview-style execution;
- the output should be written to a governed dataset;
- reviewers need run history and conflict counts rather than a downloaded result file;
- the same task will be scheduled or rerun after source refresh.
Source row filters
Some sources contain rows outside the scope of a specific fusion task. A fusion method can carry source_row_filters in its method configuration so the run keeps only the intended source slice before matching.
Example:
{
"source_row_filters": {
"APCM": {
"any": [
{ "field": "告警类型", "in": ["MMSG告警"] },
{ "field": "告警等级", "in": ["中高", "高"] }
]
}
}
}
The filter is keyed by source label. A source without a matching filter passes through unchanged. any keeps a row when at least one clause matches; all keeps a row only when every clause matches. Each clause can use in or not_in against a field value.
Treat source row filters as governed task configuration:
- document the business rule behind each filter;
- sample the source rows before and after filtering;
- rerun baseline totals before enabling the filter on a scheduled task;
- keep the raw source data available for audit and later review.
Deployments may keep source_row_filters disabled until the data owner approves the scope and capacity settings. If the environment setting is not enabled, the filter setting is ignored during dispatch.
Published rulesets and conflict fields
For methods backed by a published DFS ruleset, review the live ruleset before changing an operating task. The ruleset is the operational source for field extraction, matching rules, survivorship rules, confidence weights, and any AI-assist threshold used by the workflow.
Choose conflict fields that reflect business disagreement. Structured fields such as governed identity, asset class, operating status, severity, batch context, equipment state, timestamp bucket, or maintained object usually make better conflict signals than free-text messages or source-specific codes. Keep verbose message text and raw source codes in the evidence record so reviewers can audit the decision without inflating the conflict count.
Async execution and recovery
Fusion runs are dispatched in the background. For large streaming runs, DFS Pro accepts the job and checks for the result after dispatch, keeping the user action responsive.
If a service restart or dependency failure leaves an old run in RUNNING, the scheduler can mark the stale run as failed and unblock the task for retry. Operators should use run history to confirm the failure reason, then retry after the source, method, or capacity issue has been addressed.
Create a fusion task
- Open Data Fusion.
- Select Create Fusion Task.
- Enter a task name.
- Add a description.
- Choose a fusion mode.
- Select input datasets.
- Select a method when the task requires reusable processing logic.
- Set the output dataset name or output dataset.
- Configure conflict threshold when available.
- Save the task.
Use a name that describes the business output.
Examples:
Asset sensor and work order alignmentInspection finding to maintenance record matchEquipment alias reconciliationPredictive maintenance signal feature merge
Run the task
Use Run from the fusion task list or detail page.
During execution, a task may move through statuses such as queued, running, completed, failed, cancelled, or review.
After starting a task:
- Watch status.
- Open run history.
- Review total, matched, and conflict counts.
- Open review queue if status indicates review.
- Use output dataset only after required review is complete.
Review run history
Run history helps users understand what happened during execution.
Check:
- triggered by;
- started at;
- duration;
- total records;
- matched records;
- conflict records;
- error message when failed.
If a task fails, fix the dataset, method, mapping, or output dataset issue before retrying.
Review uncertain output
Fusion can produce conflicts, source disagreements, low-confidence matches, or manual flags. These should go through the review queue.
Reviewers should compare:
- input dataset records;
- matching keys;
- source timestamps;
- confidence;
- conflict reason;
- output record;
- downstream impact.
Retry or cancel
Use retry after fixing a failed task. Use cancel when a queued or running task should stop because the input or configuration is wrong.
Before retry:
- confirm input datasets exist and are accessible;
- confirm output dataset is writable;
- confirm method status is usable;
- confirm any
source_row_filtersstill match the intended source labels and field names; - check the last error message;
- check whether review items remain open.
Output dataset
A completed fusion task can produce an output dataset. Treat that output as governed data:
- preview rows;
- profile columns;
- assign a steward if it will be reused;
- validate the dataset after review;
- check lineage before replacing or deprecating it.
When governed identity is part of a fusion output, include the stable MDM entity ID in the output records. For example, a reliability workflow can resolve a normalized registration, tag, serial number, or maintainable-object ID through the MDM alias ledger and write the resulting entity ID beside the fused event or reliability record. Unresolved or ambiguous rows should remain visible as exceptions for steward review.
Next step
Use Review Queue to resolve fusion conflicts and rejected rows.