This project is hosted at sourceforge.net.

SourceForge Links:

WARTS - A Database Non-Ascii Character Scanner

Author: Warren Hedley (San Diego Supercomputer Center)

1 Introduction
2 Installation
3 Execution
- 3.1 Command Line Mode
- 3.2 GUI Mode
- 3.3 Account Configuration Files
4 Tool Output
5 Oracle JDBC Driver Bugs and their Workarounds
6 Troubleshooting
- 6.1 Unknown Data Types
- 6.2 Running out of Memory
7 Feature Wishlist

1 Introduction

The "WARTS" application is a database utility that goes through all of the columns in one or more tables in one or more database schemas generating a report on those columns that contain "high bits" in character data. A high bit is a byte that has the 8th bit set, and this generally indicates the presence of characters in a non-ascii encoding such as UTF-8 or ISO-8859-1. The application scans data in columns of type CHAR, VARCHAR, and CLOB.

"WARTS" is an acronym for Warren's Relational ToolSet. This is an awful acronym, and suggestions for a better one would definitely be appreciated. As the expanded name suggests, I intend to add useful features other than the non-ascii character scanner to the application as needed.

On the public website, and on the systems of users after the 'javadoc' build target has been run, the API documentation for the classes in the org.afcs.warts package is available here.

2 Installation

The application can be run using one of two JAR files (warts_cl.jar or warts_gui.jar) available in the distribution and version 9.1.0.4.0 of Oracle's JDBC drivers. In order to run the application, you will need a command line and a recent version of Java which is available at the command line. Any version of Java from the 1.3 or 1.4 series should work (you can always test what version of Java you have by typing "java -version"). For more version on why version 9.1.0.4.0 of Oracle's JDBC drivers are preferred, see the discussion of Section 5.1.

The CVS repository contains a script that can be used to build a single JAR file (simply warts.jar) containing the contents of warts_gui.jar and whatever database drivers you happen to be using. This simplifies distribution and use of the application.

Currently, because of Oracle's refusal to properly implement the parts of the JDBC 3.0 API that allow for the setting of CLOB column values (version 9.2.0.3 of their OJDBC drivers implements the methods but calling them throws an exception with an 'Unsupported feature' message), the synchronization tool has to deal directly with classes from the oracle.sql package. This currently limits WARTS to dealing with Oracle databases using the Oracle drivers. Note that you can't use the latest versions of Oracle's JDBC drivers with Java 1.3.

3 Execution

The application can be run in both command line and GUI modes. The command line performs analysis and report generation for a set of tables specified in a configuration file. The GUI version allows interactive examination of non-ascii characters in tables in addition to report generation facilities. Database connections and the list of tables/schemas to be examined are configured for both modes via configuration files specified on the command line.

3.1 Command Line Mode

The command line version of WARTS generates a report to stdout, and this can be redirected or piped using the normal shell operators. If a fatal error occurs, this will be printed to stderr and the application will exit with a code of 1. The usage required to start WARTS in command line mode is:

java -jar warts_cl.jar [ -debug ] (
    -conf account-config-file |
    -desc account-config-file |
    -nacreport account-config-file |
    -syncreport account-config-file account-config-file
)*

or alternatively:

java -cp warts.jar org.afcs.warts.cl.Main [ -debug ] (
    -conf account-config-file |
    -desc account-config-file |
    -nacreport account-config-file |
    -syncreport account-config-file account-config-file
)*

Each "account configuration file" tells the application what database accounts to use to log in, and what tables to look at.

The various flags and commands have the following meanings:

debug - print debugging information during execution
conf - loads the account configuration file, prints it to stdout. This can be used to test that the configuration file is loaded correctly.
desc - loads the account configuration file, loads the metadata for each of the tables referenced in all of the accounts referenced, and prints the metadata to stdout (in a DESC-like format). This can be used to test that all of the tables referenced in the configuration file could be accessed and analysed correctly.
nacreport - loads the account configuration file, and then runs through each of the tables referenced in all of the accounts referenced, generating a non-ascii character report.
syncreport - loads the two account configuration files, and runs through each of the tables available, generating a report on what would be required to synchronize the data in the table in the second configuration file with the data in the first. Each account configuration file must define the same number of accounts, and for each account, the same number of tables.

3.2 GUI Mode

The GUI version of WARTS allows the interactive examination of the results of non-ascii character scanning of all data in a table, right down to the cell level. This can be used to verify that the data in a table follows the expected patterns prior to manipulation of that data. The usage required to start WARTS in GUI mode is:

java -jar warts_gui.jar [ -debug ] account-config-file
    [ account-config-file ... ] -compare

or alternatively:

java -cp warts.jar org.afcs.warts.gui.Main [ -debug ]
    account-config-file [ account-config-file ... ]

As many account configuration files as are needed may be specified on the command line and the accounts from all of them will be available in the GUI. As for the command line version, a -debug option can be specified to cause debugging information to be printed to stdout.

The -compare option causes a data-comparison tab to be opened when the application boots. This can slow the application down because it may be forced to load a lot of table metadata when it attempts to find an equivalent table for the first table in the first account.

3.3 Account Configuration Files

The account configuration file consists of a set of name=value pairs, where each pair may contain whitespace before and/or after the name and/or value. Lines consisting of whitespace, or starting with # are ignored (the latter mechanism is useful for including comments or temporarily removing an account from consideration). If you hadn't already realised, this is simply your typical Java properties file.

The first required parameter is "numAccounts", which defines the number of accounts that the application will be logging in using during its analysis. For each account (with numbering starting from 1), the configuration file must define "databaseX", "usernameX", "passwordX", and "tablesX" parameters, where X is the number in question.

The "databaseX" parameter should contain the URI of the database to connect to in the format expected by the Java driver manager (see Figure 1 for an example). The "usernameX" and "passwordX" parameters should specify the username and password with which to establish a connection. (Because a password is stored in plain text in the configuration file, appropriate security precautions should be taken, like restricting access to the file or directory containing the file to just the user running the application.)

# Some text to be ignored.

numDatabases   = 1

database1 = jdbc:oracle:thin:@HOSTNAME:PORT:SID
username1 = USERNAME
password1 = PASSWORD
tables1   = *
#tables1  = users orders

Figure 1 A sample account configuration file.

The "tablesX" parameter should define a whitespace-separated list of tables to analyse from the current account. If tablenames are specified explicitly, then these tablenames should refer to tables in the current account or to synonyms that refer to tables in another account. The wildcard character "*" can be used to analyse all tables in the current account. The tables are found using the query "SELECT table_name FROM all_tables WHERE owner = usernameX". The wildcard character prefixed by another account name (e.g., "ACCOUNT.*") can be used to analyse all of the tables in another account. The tablenames are found by inserting the other account name in the aforementioned query.

Optionally, the configuration file can define a "dbDriver" parameter, in which case the application will instruct the driver manager to use the specified class as a database driver. This is useful, for instance, to find out what SQL the application is generating by using a debugging driver that prints SQL to the command line.

4 Tool Output

Both the command line and GUI versions of "WARTS" can be used to generate reports containing a list of the columns found that contain bytes with high bits and some diagnostics about the number of high bits found. An example of the output is given in Figure 2, where whitespace and some of the DESC-style columns have been removed to fit the output on paper.

Report for mol3.person (28755 rows)

             |       Characters      |              Rows
Column Name  |Lat-1|2B-U8|3B-U8|Ambig|Lat-1|2B-U8|3B-U8|Ambig|Oversized
-------------|-----|-----|-----|-----|-----|-----|-----|-----|---------
family_name  |   0 |  18 |  18 |  33 |   0 |  16 |   0 |  31 |        0
given_name   |   0 |  14 |  14 |  21 |   0 |  12 |   0 |  20 |        0
institution  |   0 |  25 |  25 |  46 |   0 |  20 |   0 |  39 |        0
department   |   0 |  15 |  15 |  43 |   0 |  10 |   0 |  32 |        0
street       |   0 |  70 |  70 | 127 |   0 |  49 |   0 |  73 |        0
city         |   0 |  28 |  28 |  35 |   0 |  22 |   0 |  32 |        0
state        |   0 |   1 |   1 |   6 |   0 |   1 |   0 |   6 |        0
username     |   0 |   1 |   1 |   3 |   0 |   1 |   0 |   3 |        0

Figure 2 A sample of tool output, that has been cut up so it can fit in this space, as described in the text.

For each column that contains non-ascii characters, the following figures are given:

Lat-1 Chrs - the number of Latin-1 characters found.
2B-U8 Chrs - the number of 2-byte UTF-8 characters found.
3B-U8 Chrs - the number of 3-byte UTF-8 characters found.
Ambig Chrs - the number of ambiguous characters found. An ambiguous character is probably a single UTF-8 character, but might be 2 Latin-1 characters.
Lat-1 Rows - the number of rows (or cells alternatively) containing Latin-1 characters.
2B-U8 Rows - the number of rows containing 2-byte UTF-8 characters.
3B-U8 Rows - the number of rows containing 3-byte UTF-8 characters.
Ambig Rows - the number of rows containing ambiguous characters found.
Oversized - the number of cells currently containing Latin-1 characters that would overflow the column size if converted in place to UTF-8.

5 Oracle JDBC Driver Bugs and their Workarounds

Numerous bugs in Oracle's JDBC drivers have caused several undesirable workarounds to be coded into WARTs. This section documents those workarounds in the hope that they can be removed at some time in the future when the bugs are resolved. All of these bugs have been seen and demonstrated using version 9.2.0.3.0 of Oracle's standalone JDBC drivers for JDK 1.4 (a file called ojdbc.jar). I should also mention that I was testing using JDK 1.4.1_04 on Redhat Linux 9, and connecting to Oracle 9.2.0.2.0, and that I'm only considering the "thin" driver here (I haven't bothered with the OCI driver).

5.1 Unreliable `*LOB` loading in 9.2.0.3.0

The most serious problem I've come across is the unpredictable crashing of some of Oracle's more recent JDBC drivers when loading LOB data from a UTF-8 database. The bug appeared briefly during our buildup to the migration from an ASCII to a UTF-8 database, and can't be reliably replicated. It seems to depend on the database configuration, possibly occurring when byte length semantics are used in a database that has character length semantics as the default. In particular, 9.2 series of Oracle's JDBC drivers (I've tested 9.2.0.1.0 and 9.2.0.3.0) seem to die after loading a certain number of *LOB fields, throwing a "Bigger type length than Maximum" exception. The error seems to happen at a reasonably consistent place and is independent of whether the LOBs are loaded from a single or multiple ResultSets. Continuing to try to load stuff after the exception has been triggered eventually results in something worse than a SQLException that can crash the application (such as a NullPointerException).

If you see this problem, you should hopefully be able to use some older drivers that don't experience the problem. One set of drivers that I've found to work are version 9.1.0.4.0 of the Enterprise drivers (classes12dms.jar and dms.jar). These ship with version 9.0.3.0.0 of the OC4J server. Unfortunately, you will no doubt quickly find that these drivers take approximately three times as long as the 9.2 drivers to load LOB data.

5.2 `LONG` and `*LOB` queries can't be read in one query

As reported in this post to Oracle's JDBC forum, it is impossible to load a column value of type LONG after a value of type *LOB has been read from the same row of the ResultSet. As a workaround, columns of type LONG are simply not handled at present.

5.3 Querying character column sizes using DatabaseMetadata

As reported in this post to Oracle's JDBC forum, Oracle's JDBC drivers do not correctly report the length of CHAR and VARCHAR columns that use character length semantics (in a UTF-8 database). The JDBC drivers return the number of bytes allocated, which is (usually) four times the number of characters in the column. To make matters worse, the maximum size of a CHAR or VARCHAR column is 4000 bytes, so any columns that takes any number over 1000 characters, will be reported as having a size of 4000.

Some undesirable workarounds were implemented to deal with these bugs. First, the org.afcs.warts.db.AccountConfiguration class divides all character column sizes by 4 when it detects that it is using the Oracle drivers to connect to a UTF-8 database. This is obviously going to return incorrect results for any columns that have byte length semantics, for any columns with a size of over 1000 characters, or for any fixed Oracle JDBC drivers (at that point, we'll have to start checking version numbers).

Secondly, because the column sizes are so unreliable, we can no longer use them when comparing column definitions, or checking whether two tables can be compared. Changes were made to the equals() and hashCode() methods in org.afcs.warts.db.ColumnDescription and to the documentation of the isEquivalentTo() method in org.afcs.warts.db.TableDescription.

The alternative to the above workarounds would be to use an Oracle-metadata-specific SELECT query to obtain accurate column sizes. This may yet be implemented if the workarounds discussed above prove insufficient. However, it will be resisted to the bitter end.

5.4 Querying CHAR columns with character length semantics

As reported in this post to Oracle's JDBC forum, Oracle's JDBC drivers return the wrong value when using ResultSet.getString() to obtain a value from a column of type CHAR (VARCHAR is fine). The value returned is padded with enough space characters (&x#20;) to fill the number of bytes that would be obtained by querying the column size via the metadata (as discussed above).

As a workaround, the loadValue() method in org.afcs.warts.handlers.CharacterHandler truncates the strings received from all columns of type CHAR in a UTF-8 database, where the number of bytes returned by ResultSet.getBytes() is equal to four times the column size (as discussed above, the column size is obtained by dividing the reported column size by four). This will not work for columns of type CHAR with a size of over 1000, because the size cannot be accurately determined, and so column values will always be truncated to 1000.

5.5 SQL types from DatabaseMetadata can't be used to bind variables

As reported in this post to Oracle's JDBC forum, the column types returned by Oracle's implementation of DatabaseMetadata can not be used to bind variables in PreparedStatement.setObject(). This means that each org.afcs.warts.handlers.DataHandler implementation in WARTs has to provide an implementation of bindNonNullValue(). This is a minor inconvenience, and probably more efficient in any case.

6 Troubleshooting

6.1 Unknown Data Types

The current most likely cause of error is an inability of the application to understand a datatype that it encounters in the database. Currently the application should be able to process all character, numeric, raw, and lob column types. Unexpected types may still lurk in our schemas, and if you find one, please submit the actual type of the column (as reported by DESC) and the integer column type index which will be printed in the error message to me.

6.2 Running out of Memory

If the tables you are processing are particularly large, then you may well run out of memory as the tables are being loaded into memory. The application caches all data in primary key columns as well as any columns containing character data (including CLOB columns), and this can add up. It is possible to increase the amount of memory allocated by Java with the following arguments "-Xms128m -Xmx384m" (inserting the necessary amount of memory). Type "java -X" for the documentation for these non-standard options.

7 Feature Wishlist

Column Exclusion - it would be nice to be able to manually exclude columns from the load, non-ascii scan, comparison, and synchronization processes. Being able to exclude columns from loading would obviously take care of all 4, so would be a good start.
Better Configuration Files - To enable features like column exclusion to be used in batch files, we'd need better configuration files than we currently have. An XML file that let you specify columns to be excluded as well as SQL to be executed before and after synchronization (for disabling and re-enabling triggers) would be cool.