|
Return to Newsletter Contents...
Data Integrator Optimization Tuning Techniques Introduction
by: Tom Hinkle, Consultant,
CRCP, Data Integrator Instructor
Source and
Target Based Tuning techniques
Source-based performance options
Using
array fetch size
Data Integrator provides an easy way to request and
obtain multiple data rows from source databases. The array fetch size feature
allows you to retrieve data using fewer requests, thereby significantly reducing
traffic on your network. Tuning array fetch size can also reduce the amount of
CPU use on the Job Server computer.
Array Fetch Size
indicates the number of rows returned in a single fetch call to a source table.
The default value is 1000 and the maximum value is 5000. This value reduces the
number of round-trips to the database and can improve performance for table
reads.
Caching data
You can improve the performance of data transformations
that occur in memory by caching as much data as possible. By caching data, you
limit the number of times the system must access the database. Cached data must
fit into available memory. Select Yes in a source table or file editor to
specify that data from the source is cached using memory on the Job Server
computer.
Caching Tables
Business Objects recommends that you apply table
caching for small tables (typically with a table size of less than
1% of total physical memory). You
can specify the number of rows in a table to override the default value Data
Integrator uses during execution. To specify a table size, set the Estimated
row count attribute for the table. The default row count value is 50,000.
Caching joins
Cache a source only if it is being used as
an inner join. (Inner joins have a lower join rank than outer joins). Caching
does not affect the order in which tables are joined. If optimization conditions
are such that Data Integrator is pushing down operations to the underlaying
database, it ignores your cache setting. If a table becomes too large to fit in
the cache, disable caching for the table.
Caching lookups
You can also improve performance by caching
data when looking up
individual values from tables and files.
There are two methods of looking up data:
Using a Lookup function in a query
Data Integrator has three Lookup functions:
lookup,
lookup_seq,
and
lookup_ext. The
lookup
and
lookup_ext
functions have cache options. Caching lookup tables improves performance
because Data Integrator avoids the expensive task of creating a database query
on each row.
Using a source table and setting it as the outer join
Although you can use lookup functions inside Data
Integrator queries, an alternative is to expose the translate (lookup) table as
a source table in the data flow diagram, and use an outer join (if necessary) in
the query to look up the required data.
Caching table comparisons
You can improve the performance of a
Table_Comparison transform by caching the comparison table.
There are three modes of comparisons:
Row-by-row select
Cached comparison table
Sorted input
Of the three, Row-by-row select will
likely be the slowest and Sorted input the fastest.
Join
ordering
Join ordering is determined by the type of
join you choose to use in Data Integrator. There are two types of joins:
Normal
— Created by drawing connection lines from multiple sources in a data flow to
the same query and then entering an integer as a Join rank value for each
source using source editors. Data Integrator orders join operations using these
values.
Outer Join — Created by taking a normal join and then using the
Outer Join tab of the query to specify the outer and inner sources. Data
Integrator processes the outer source as if it was assigned the highest join
rank.
For a join between two tables, assign a
higher join rank value to the larger
table and, if possible, cache the smaller table. For a join between a table and
file:
o
If the file is small and can be cached, then assign it a lower join rank
value and cache it.
o
If you cannot cache the file, then assign
it a higher join rank value so that it becomes the “outer table” in the join.
For a join between two files, assign a
higher rank value to the larger file and, if possible, cache the smaller file.
Minimizing extracted data
The best way to minimize the amount of data extracted
from the source systems is to retrieve only the data that has changed since the
last time you performed the extraction. This technique is called changed-data
capture, and it is described in detail in the Data Integrator Designer
Guide, Chapter 19: Techniques for Capturing Changed Data .
Target-based performance options
Loading method and rows per commit
You can choose to use regular loading or bulk loading with Data Integrator.
To improve performance, you can select the
following options from the target table editor. Note that if you use one, you
cannot use the others for the same target.
•
Enable Partitioning
Parallel loading option. The number of parallel loads
is determined by the number of partitions in the target table.
•
Number of Loaders
Parallel loading option. The number of parallel loads
is determined by the number you enter for this option.
•
Transactional loading
This option allows you to commit data to multiple
tables as part of the same transaction.
Rows per commit
for regular loading defaults to 1000 rows. Setting the Rows per commit
value significantly affects job performance. Adjust the rows per commit value in
the target table editor’s Options tab, noting the following rules:
•
Do not use
negative number signs and other non-numeric characters.
•
If you enter
nothing or
0,
the text box will automatically display
1000.
•
If you enter a
number larger than
5000,
the text box automatically displays
5000, which is the
maximum value allowed.
Staging tables to speed up auto-correct loads
For large loads where auto-correct is
required:
• For Oracle targets, create two data flows:
• In the first data flow, extract data and load it to a
staging table using a temporary table object that resides in the same datastore
as your target table.
• In the second data flow, use the staging table as the
source and an Oracle database table as the target. In the target table editor,
select Auto-correct load. This entire statement will be pushed down to
the database, maximizing performance.
• For other targets, load data into a staging table and
use post-load SQL, a script, or a data flow to:
• Delete rows from the target where the key is in the
staging table.
• Insert rows from the staging table into the target.
Additional information can be found in the
Data
Integrator Performance Optimization Guide
Go to Top |
Return to Newsletter Contents
|