Skip to content

TheWeatherChannel/dClass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dClass - Pattern Classification Engine

dClass is an indexed pattern classification engine. Unlike a search index which
indexes your data and runs queries over it, dClass indexes your queries
(patterns) and runs data over it, classifying the data against the patterns.

dClass is capable of performing near constant time pattern classification.
dClass can quickly and accurately find the best matching pattern for a given
input. For an input of size M classified against an index of size N, dClass
has worst case O(M) performance, even for large values of N. To accomplish
this, dClass uses a dtree, a multi edge aliased network of sub pattern nodes.
This structure is heavily optimized for searching, retrieval, and high
performance on modern day CPUs (classification runtime is in the range of 5
microseconds on modern day hardware).

dClass introduces several classification pattern types: STRONG, CHAIN, WEAK,
and NONE. These types can be coupled with regular expressions, absolute
positioning, grouping, inheritance, duplication, ranking, and directional
proximity. This allows for an expressive index which is capable of handling
complex context aware patterns like language as well as simpler pattern
classifications like device detection, all under a unified syntax and API.

dClass is built in a modular fashion and allows for schema free data modeling.
This means that multiple pattern indexes can be combined with their own custom
classification language allowing for networked knowledge based classification
while retaining near constant time performance.

Please see the webClass project to see dClass in action:

https://github.com/rezan/webClass


 PATTERNS

dClass can load its patterns from a .dtree file or from a DDR xml directory.
Patterns can also be added directly to the index via the dtree C API. The test
client allows for the conversion of a DDR xml directory into a .dtree file and
an API exists to dump the current index into a .dtree file.

Please see the README in the dtree directory for detailed dtree pattern notes,
examples, and tips:

https://github.com/TheWeatherChannel/dClass/blob/master/dtrees/README


 AUTHORS

Reza Naghibi ([email protected])

Special thanks:
OpenDDR team, Anthony Watson, Eric Honer, Joe Pearson, Luke Kolin,
Ivan Kozhuharov, Chris Hill, Chris McClellen, and The Weather Channel.


 DCLASS VS GREP

https://github.com/TheWeatherChannel/dClass/wiki/dClass-vs-grep


 DEVICE MAP (OPENDDR)

This project will track DeviceMap updates with its own patches. All DeviceMap
updates will be backwards compatible with dClass 2.0 code.


 HOWTO

To compile the test client, run make in the src directory.

To build with varnish or nginx, please reference the READMEs in the varnish
and nginx servers subdirectories.

To integrate with the dClass API:

  -include the dClass header file:
    #include "dclass_client.h"

  -define a dclass_index:
    dclass_index dci;

  -populate the index using a dtree file or DeviceMap resource file:
    dclass_load_file(&dci,"/path/to/file.dtree");
    -OR-
    openddr_load_resources(&dci,"/path/to/devicemap/devicedata");

  -classify a string against the index and get the resulting kv data:
    dclass_keyvalue *kv=dclass_classify(&dci,"this is a string");
    char *id=kv->id;
    char *field_xyz=dclass_get_kvalue(kv,"xyz");

  -freeing the index:
    dclass_free(&dci);


 ROADMAP

Enhancements for 2.4

  -Configurable v16 bit, v32 bit, and native addressing per index
  -Move certain index settings from global to configurable per index
  -Better support for Unicode [1]

Enhancements for 3.0

  -Realtime additions, modifications, and deletion
  -Expanded regex support


 JAVA

dClass supports Java via a native JNI extension. A custom JNI loader is used.
It first attempts to load a system shared object (dclassjava). If that fails,
it then attempts to load a locally packaged shared object. A pre compiled jar
is included at java/dist/dclass.jar which comes packaged with 32bit and 64bit
shared objects for Windows, Linux, and OS X.


 NOTES

All US-ASCII alphanumeric characters are pattern searchable. Non alphanumeric
pattern searchable characters are defined in DTREE_HASH_SCHARS. These chars
are word separators. Indexed US-ASCII print characters (0x20 thru 0x7E) which
aren't pattern searchable are replaced with DTREE_PATTERN_ANY and can match on
any character. All pattern matching is US-ASCII case insensitive. Extended
non-separator character set recognition is supported via DTREE_HASH_TCHARS [1].

Write operations on the index are not thread safe. Read operations are thread
safe (with at most one writer). Read operations have the dclass index parameter
designated with a 'const'.

Memory limits are tightly bounded. Default 16bit configuration allows for 65k
search nodes and 64MB of general memory. Adjusting DTREE_DT_PACKED* will allow
for more search nodes and increasing DTREE_M_MAX_SLABS will allow for more
general use memory. Since the dtree data structure is memory pointer heavy,
pointers have the option to be compressed down into 16bit or 32bit values.

[1] Unicode can be supported by adding '%' to DTREE_HASH_TCHARS and using
    percent encoding on your data.