org.afcs.warts.db
Class DataHighBitAnalysis

java.lang.Object
  extended byorg.afcs.warts.db.DataHighBitAnalysis
All Implemented Interfaces:
java.lang.Comparable

public final class DataHighBitAnalysis
extends java.lang.Object
implements java.lang.Comparable

The DataHighBitAnalysis class performs "high-bit" analysis on an array of bytes, classifying each byte as ascii, latin-1, utf-8 or ambiguous, and classifying the string as a whole.

One of the tricky parts of character classification is trying to tell the difference between bytes that make up a single multibyte UTF-8 character and bytes that make up several Latin-1 characters. These bytes are currently classified as ambiguous, with the exception of certain east european characters (lower case vowels with umlauts etc.), where the probability of them being part of a 2 byte UTF-8 character is much higher than of them being 2 Latin-1 characters (which would typically look something like 'ΓΌ').

A byte may be classed as illegal when it appears that a combination of bytes could not possibly be transformed into a valid transformation of bytes. This most often occurs when a byte in the range 0x80 - 0x9F would have to lead off a character. This is not in the valid Latin 1 range, so can be characterised as illegal.

LICENSE: This code is released to the public domain and may be used for any purpose whatsoever without permission or acknowledgment.

Version:
Last Modified 19 September 2003
Author:
Warren Hedley ( whedley at sdsc dot edu )

Field Summary
static byte BYTE_CLASS_ASCII
          The byte was classified as an ASCII byte (0-127).
static byte BYTE_CLASS_FIRST_AMBIGUOUS
          The byte was classified as the first byte of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).
static byte BYTE_CLASS_ILLEGAL
          The byte was classified as illegal, which can happen when a combination of bytes can not be transformed into a collection of valid characters.
static byte BYTE_CLASS_LATIN_1
          The byte was classified as a Latin-1 character (128-255).
static byte BYTE_CLASS_NOT_FIRST_AMBIGUOUS
          The byte was classified as one byte (but not the first) of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).
static byte BYTE_CLASS_THREE_BYTE_UTF_8
          The byte was classified as part of a 3 byte UTF-8 character.
static byte BYTE_CLASS_TWO_BYTE_UTF_8
          The byte was classified as part of a 2 byte UTF-8 character.
static int DATA_CLASS_ASCII
          The string consists of nothing but ascii characters.
static int DATA_CLASS_ASCII_AND_2_BYTE_UTF_8
          The string consists of a mix of ascii and 2 byte UTF-8 characters.
static int DATA_CLASS_ASCII_AND_3_BYTE_UTF_8
          The string consists of a mix of ascii and 3 byte UTF-8 characters.
static int DATA_CLASS_ASCII_AND_LATIN_1
          The string consists of a mix of ascii and Latin-1 characters.
static int DATA_CLASS_CONTAINS_AMBIGUOUS
          The string consists of a mix of ascii and "ambiguous" characters.
static int DATA_CLASS_CONTAINS_ILLEGAL
          The string contains illegal bytes.
static int DATA_CLASS_CONTAINS_MULTIPLE
          The string consists of a mix of ascii and multiple classes of non-ascii characters.
 
Constructor Summary
DataHighBitAnalysis(byte[] data, int numBytesAllowed)
          Constructs a new instance with the specified data, and the size of the column that the data is in.
 
Method Summary
 int compareTo(java.lang.Object otherObj)
          Compares this instance to another object, returning an integer that can be used to sort an array of DataHighBitAnalysis instances based on a case-insensitive comparison of the string returned by getString().
 boolean equals(java.lang.Object otherObj)
          Returns true if the specified object is a DataHighBitAnalysis instance with the same value (what is returned by getString()) as the current instance.
 byte[] getClassifications()
          Returns the array of classifications for each byte.
 byte[] getData()
          Returns the original byte array that was analysed.
 int getDataClass()
          Returns a classification for the byte array as a whole.
 int getNum2ByteUtf8Chars()
          Returns the number of 2 byte UTF-8 characters found during analysis.
 int getNum3ByteUtf8Chars()
          Returns the number of 3 byte UTF-8 characters found during analysis.
 int getNumAmbiguousBytes()
          Returns the number of ambiguous bytes found during analysis.
 int getNumIllegalBytes()
          Returns the number of illegal bytes found during analysis.
 int getNumLatin1Chars()
          Returns the number of Latin-1 characters found during analysis.
 java.lang.String getString()
          Returns a string representation of the byte data using the preferred encoding.
 java.lang.String getStringAsLatin1()
          Returns a string representation of the byte data where the encoding is assumed to be Latin-1.
 java.lang.String getStringAsUtf16()
          Returns a string representation of the byte data where the encoding is assumed to be UTF-16, also known as UCS-2.
 java.lang.String getStringAsUtf8()
          Returns a string representation of the byte data where the encoding is assumed to be UTF-8.
 int hashCode()
          Returns a hashcode for the current instance based on the current string value (as returned by getString()).
 java.lang.String toString()
          Returns a text description of the current instance that can be used for debugging purposes.
 boolean utf8Oversize()
          Returns true if the UTF-8 representation of this string would overflow the column its in (presumably the string is encoded using Latin-1 now).
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

BYTE_CLASS_ASCII

public static final byte BYTE_CLASS_ASCII
The byte was classified as an ASCII byte (0-127).

See Also:
Constant Field Values

BYTE_CLASS_LATIN_1

public static final byte BYTE_CLASS_LATIN_1
The byte was classified as a Latin-1 character (128-255).

See Also:
Constant Field Values

BYTE_CLASS_TWO_BYTE_UTF_8

public static final byte BYTE_CLASS_TWO_BYTE_UTF_8
The byte was classified as part of a 2 byte UTF-8 character.

See Also:
Constant Field Values

BYTE_CLASS_THREE_BYTE_UTF_8

public static final byte BYTE_CLASS_THREE_BYTE_UTF_8
The byte was classified as part of a 3 byte UTF-8 character.

See Also:
Constant Field Values

BYTE_CLASS_FIRST_AMBIGUOUS

public static final byte BYTE_CLASS_FIRST_AMBIGUOUS
The byte was classified as the first byte of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).

See Also:
Constant Field Values

BYTE_CLASS_NOT_FIRST_AMBIGUOUS

public static final byte BYTE_CLASS_NOT_FIRST_AMBIGUOUS
The byte was classified as one byte (but not the first) of an ambiguous set of bytes (one that might be a single multi-byte UTF-8 character, or several Latin-1 characters).

See Also:
Constant Field Values

BYTE_CLASS_ILLEGAL

public static final byte BYTE_CLASS_ILLEGAL
The byte was classified as illegal, which can happen when a combination of bytes can not be transformed into a collection of valid characters.

See Also:
Constant Field Values

DATA_CLASS_ASCII

public static final int DATA_CLASS_ASCII
The string consists of nothing but ascii characters.

See Also:
Constant Field Values

DATA_CLASS_ASCII_AND_LATIN_1

public static final int DATA_CLASS_ASCII_AND_LATIN_1
The string consists of a mix of ascii and Latin-1 characters.

See Also:
Constant Field Values

DATA_CLASS_ASCII_AND_2_BYTE_UTF_8

public static final int DATA_CLASS_ASCII_AND_2_BYTE_UTF_8
The string consists of a mix of ascii and 2 byte UTF-8 characters.

See Also:
Constant Field Values

DATA_CLASS_ASCII_AND_3_BYTE_UTF_8

public static final int DATA_CLASS_ASCII_AND_3_BYTE_UTF_8
The string consists of a mix of ascii and 3 byte UTF-8 characters.

See Also:
Constant Field Values

DATA_CLASS_CONTAINS_AMBIGUOUS

public static final int DATA_CLASS_CONTAINS_AMBIGUOUS
The string consists of a mix of ascii and "ambiguous" characters.

See Also:
Constant Field Values

DATA_CLASS_CONTAINS_MULTIPLE

public static final int DATA_CLASS_CONTAINS_MULTIPLE
The string consists of a mix of ascii and multiple classes of non-ascii characters. This usually reflects a real problem in the data.

See Also:
Constant Field Values

DATA_CLASS_CONTAINS_ILLEGAL

public static final int DATA_CLASS_CONTAINS_ILLEGAL
The string contains illegal bytes. This overrides any other data classes.

See Also:
Constant Field Values
Constructor Detail

DataHighBitAnalysis

public DataHighBitAnalysis(byte[] data,
                           int numBytesAllowed)
Constructs a new instance with the specified data, and the size of the column that the data is in.

Parameters:
data - The bytewise representation of the string to analyse.
numBytesAllowed - The number of bytes allowed in the column from which the data was taken. This determines what utf8Oversize() will return.
Throws:
java.lang.NullPointerException - If data is null.
Method Detail

getData

public byte[] getData()
Returns the original byte array that was analysed. The array returned is the same as the one used internally, so should not be modified by the caller if the reference to the analysis object is shared.

Returns:
The original byte array that was analysed.

getClassifications

public byte[] getClassifications()
Returns the array of classifications for each byte. The array returned will be the same length as the data array specified at initialisation (and returned by getData(), and each byte in the array will be one of the BYTE_CLASS_* constants defined in this class. The array returned is the same as the one used internally, so should not be modified by the caller if the reference to the analysis object is shared.

Returns:
The array of classifications for each byte.

getNumLatin1Chars

public int getNumLatin1Chars()
Returns the number of Latin-1 characters found during analysis.

Returns:
The number of Latin-1 characters found during analysis.

getNum2ByteUtf8Chars

public int getNum2ByteUtf8Chars()
Returns the number of 2 byte UTF-8 characters found during analysis.

Returns:
The number of 2 byte UTF-8 characters found during analysis.

getNum3ByteUtf8Chars

public int getNum3ByteUtf8Chars()
Returns the number of 3 byte UTF-8 characters found during analysis.

Returns:
The number of 3 byte UTF-8 characters found during analysis.

getNumAmbiguousBytes

public int getNumAmbiguousBytes()
Returns the number of ambiguous bytes found during analysis. An ambiguous byte may be part of a multibyte UTF-8 character or may be multiple Latin-1 characters.

Returns:
The number of ambiguous bytes found during analysis.

getNumIllegalBytes

public int getNumIllegalBytes()
Returns the number of illegal bytes found during analysis. An illegal byte is flagged when a set of bytes couldn't possibly be transformed into a set of valid characters.

Returns:
The number of illegal bytes found during analysis.

getString

public java.lang.String getString()
Returns a string representation of the byte data using the preferred encoding. If the string contains any Latin-1 characters, this will be Latin-1, otherwise it will be UTF-8.

Returns:
A string representation of the byte data using the preferred encoding for the data.

getStringAsLatin1

public java.lang.String getStringAsLatin1()
Returns a string representation of the byte data where the encoding is assumed to be Latin-1. Bytes making up any UTF-8 characters will come out looking rather strange.

Returns:
A string representation of the byte data where the encoding is assumed to be Latin-1.

getStringAsUtf8

public java.lang.String getStringAsUtf8()
Returns a string representation of the byte data where the encoding is assumed to be UTF-8.

Returns:
A string representation of the byte data where the encoding is assumed to be UTF-8.

getStringAsUtf16

public java.lang.String getStringAsUtf16()
Returns a string representation of the byte data where the encoding is assumed to be UTF-16, also known as UCS-2.

Returns:
A string representation of the byte data where the encoding is assumed to be UTF-16.

getDataClass

public int getDataClass()
Returns a classification for the byte array as a whole. The code returned will be one of the DATA_CLASS_* constants specified in this class. Note that the presence of illegal characters will always cause the data class to be DATA_CLASS_CONTAINS_ILLEGAL regardless of what other characters are in the string.

Returns:
A classification for the byte array as a whole.

utf8Oversize

public boolean utf8Oversize()
Returns true if the UTF-8 representation of this string would overflow the column its in (presumably the string is encoded using Latin-1 now). The column size is set at initialisation.

Returns:
True if the UTF-8 representation of this string would overflow the column its in.

compareTo

public int compareTo(java.lang.Object otherObj)
Compares this instance to another object, returning an integer that can be used to sort an array of DataHighBitAnalysis instances based on a case-insensitive comparison of the string returned by getString().

Specified by:
compareTo in interface java.lang.Comparable
Parameters:
otherObj - The object to compare this instance to.
Returns:
An integer that can be used to sort an array of DataHighBitAnalysis instances.

equals

public boolean equals(java.lang.Object otherObj)
Returns true if the specified object is a DataHighBitAnalysis instance with the same value (what is returned by getString()) as the current instance.

Parameters:
otherObj - The object to compare this instance to.
Returns:
True if the specified object is a DataHighBitAnalysis instance with the same value as the current instance.

hashCode

public int hashCode()
Returns a hashcode for the current instance based on the current string value (as returned by getString()).

Returns:
A hashcode for the current instance.

toString

public java.lang.String toString()
Returns a text description of the current instance that can be used for debugging purposes.

Returns:
A text description of the current instance that can be used for debugging purposes.