com.aliasi.chunk
Class ChunkingImpl

java.lang.Object
  extended by com.aliasi.chunk.ChunkingImpl
All Implemented Interfaces:
Chunking, Iterable<Chunk>

public class ChunkingImpl
extends Object
implements Chunking, Iterable<Chunk>

A ChunkingImpl provides a mutable, set-based implementation of the chunking interface. At construction time, a character sequence or slice is specified. Chunks may then be added using the add(Chunk) method.

Since:
LingPipe2.1
Version:
3.9
Author:
Bob Carpenter

Constructor Summary
ChunkingImpl(char[] cs, int start, int end)
          Construct a chunking implementation to hold chunks over the specified character slice.
ChunkingImpl(CharSequence cSeq)
          Constructs a chunking implementation to hold chunks over the specified character sequence.
 
Method Summary
 void add(Chunk chunk)
          Add a chunk this this chunking.
 void addAll(Collection<Chunk> chunks)
          Adds all of the chunks in the specified collection to this chunking.
 CharSequence charSequence()
          Returns the character sequence underlying this chunking.
 Set<Chunk> chunkSet()
          Return an unmodifiable view of the set of chunks for this chunking.
static boolean equal(Chunking chunking1, Chunking chunking2)
          Returns true if the specified chunkings are equal.
 boolean equals(Object that)
          Returns true if the specified object is a chunking equal to this one.
 int hashCode()
          Returns the hash code for this chunking.
static int hashCode(Chunking chunking)
          Returns the hash code for the specified chunking.
 Iterator<Chunk> iterator()
          Returns an unmodifiable iterator over the chunk set underlying this chunking implementation.
static Chunking merge(Chunking chunking1, Chunking chunking2)
          Return the result of combining two chunkings into a single non-overlapping chunking.
static boolean overlap(Chunk chunk1, Chunk chunk2)
          Returns true if the chunks overlap at least one character position.
 String toString()
          Returns a string-based representation of this chunking.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ChunkingImpl

public ChunkingImpl(CharSequence cSeq)
Constructs a chunking implementation to hold chunks over the specified character sequence. The sequence is stored immutably in this implementation, so later changes to the sequence provided to this constructor will not affect the constructed chunking implementation. All chunks added must be within this character sequence's bounds.

Parameters:
cSeq - Character sequence underlying the chunking.

ChunkingImpl

public ChunkingImpl(char[] cs,
                    int start,
                    int end)
Construct a chunking implementation to hold chunks over the specified character slice. The slice is copied, so later changes to it do not affect the constructed chunking. All chunks added to this chunking must be within this character slice's (relative) bounds. The chunks themselves will have indices relative to the start parameter of this constructor, rather than absolute offsets into this character slice.

Parameters:
cs - Character array.
start - Index in array of first element in chunk.
end - Index in array of one past the last element in chunk.
Method Detail

addAll

public void addAll(Collection<Chunk> chunks)
Adds all of the chunks in the specified collection to this chunking. If any of the chunks do not implement the Chunk interface, an illegal argument exception is thrown.

Parameters:
chunks - Chunks to add to this chunking.
Throws:
IllegalArgumentException - If the collection contains an object that does not implement Chunk.

iterator

public Iterator<Chunk> iterator()
Returns an unmodifiable iterator over the chunk set underlying this chunking implementation. The chunks will be iterated in the order in which they were added to this implementation.

Specified by:
iterator in interface Iterable<Chunk>
Returns:
Unmodifiable iterator over the set of chunks.

add

public void add(Chunk chunk)
Add a chunk this this chunking. The chunk must have start and end points within the bounds provided by the character sequence underlying this chunking.

Parameters:
chunk - Chunk to add to this chunking.
Throws:
IllegalArgumentException - If the end point is beyond the underlying character sequence.

charSequence

public CharSequence charSequence()
Returns the character sequence underlying this chunking.

Specified by:
charSequence in interface Chunking
Returns:
The character sequence underlying this chunking.

chunkSet

public Set<Chunk> chunkSet()
Return an unmodifiable view of the set of chunks for this chunking. The chunk set will iterate elements in the order in which they were added to the chunking.

Specified by:
chunkSet in interface Chunking
Returns:
The set of chunks for this chunking.

equals

public boolean equals(Object that)
Description copied from interface: Chunking
Returns true if the specified object is a chunking equal to this one. Equality for chunking is defined by character sequence yield equality and chunk set equality. Character sequences are tested for equality with Strings.equalCharSequence(CharSequence,CharSequence) and chunks are compared as sets with elements tested for equality using Chunk.equals(Object). There is a utility implementation of this definition provided for chunkings in equal(Chunking,Chunking).

Specified by:
equals in interface Chunking
Overrides:
equals in class Object
Parameters:
that - Object to compare.
Returns:
true if the specified object is a chunking equal to this one.

hashCode

public int hashCode()
Description copied from interface: Chunking
Returns the hash code for this chunking. Hash codes for chunkings are defined by:
 hashCode() 
   = Strings.hashCode(charSequence())
     + 31 * chunkSet().hashCode()
 
There is a utility implementation of this definition provided for chunkings in hashCode(Chunking).

Specified by:
hashCode in interface Chunking
Overrides:
hashCode in class Object
Returns:
The hash code for this chunking.

toString

public String toString()
Returns a string-based representation of this chunking. This representation includes the character sequence and each chunk in the chunk set.

Overrides:
toString in class Object
Returns:
String-based representation of this chunking.

equal

public static boolean equal(Chunking chunking1,
                            Chunking chunking2)
Returns true if the specified chunkings are equal. Chunking equality is defined in Chunking.equals(Object) to be equality of character sequence yields and equality of chunk sets.

Warning: Equality is unstable if the chunkings change.

Parameters:
chunking1 - First chunking.
chunking2 - Second chunking.
Returns:
true if the chunkings are equal.

hashCode

public static int hashCode(Chunking chunking)
Returns the hash code for the specified chunking. The hash code for a chunking is defined by Chunking.hashCode().

Warning: Hash codes are unstable if the chunkings change.

Parameters:
chunking - Chunking whose hash code is returned.
Returns:
The hash code for the specified chunking.

overlap

public static boolean overlap(Chunk chunk1,
                              Chunk chunk2)
Returns true if the chunks overlap at least one character position.

Chunks chunk1 and chunk2 overlap if

 chunk1.start() <= chunk2.start() < chunk1.end()
or
 chunk2.start() <= chunk1.start() < chunk2.end()

Parameters:
chunk1 - First chunk to test.
chunk2 - Second chunk to test.
Returns:
true if the chunks overlap at least one character position.

merge

public static Chunking merge(Chunking chunking1,
                             Chunking chunking2)
Return the result of combining two chunkings into a single non-overlapping chunking. Chunks in the first chunking are sorted based on a Chunk.TEXT_ORDER_COMPARATOR, and then visited left to right, keeping chunks that don't overlap chunks appearing earlier in the order. Next, chunks are added from the second chunking in the same way, first by sorting, then by adding in order, all the chunks that are consistent with existing chunks.

The returned chunking has a string as a character sequence rather than copying one of the input chunking's character sequence.

Overall, this is an O(n log n) operation because of the sorting. It also allocates arrays for each of the input chunking's chunks, and the string and the chunk set for the result.

Parameters:
chunking1 - First chunking to combine.
chunking2 - Second chunking to combine.
Returns:
Combination of the two chunkings.
Throws:
IllegalArgumentException - If the chunkings are not over the same character sequence.