Ent - Plotting entropy and computing string metrics

Introduction:
Looking at statistical plot which shows a metric such as the entropy of the contents of a files or a large data stream/string is often the fastest way of  roughly estimating the file's/string's contents.

Ent is a command line utility which can take data streams from the standard-in (Stdin), arbitrary data via pipes and/or files which have to be passed as the first parameters. Ent analyzes these and does so by calculating the individual file's data metrics. It then plots the accumulative metrics which are yielded by sequential concatenation of the data (e.g. the files, hence file order in the console parameters matters).

Ent with a text file containing ASCII graphics. regions with text and those with graphics can be discerned.

Ent using several files. The plot encompasses the accumulative data.


When no parameters are passed or '-h' or '?' the following command line usage will be displayed


╬===================================================================╬
| Ent v09/2011 Entropy calculator & string metrics; lsauer - univie |
╬===================================================================╬


Usage: shantropy [<filename1> <filename2>...first params!] [-o <outfile>] [-h he
lp]
[-e efficiency] [-m 1,2.. 1st,2nd order markov] [-b base-alphabet]
[-w width plot] [-h height plot] [-s <string> as last param!]



Parameters:
-e plot the efficiency of the data
-w <int> sets the width of the plot
-h <int> sets the height of the plot
-o outfilename ...will plot data to a given file and create or append to the file
use '>' myfilename.out to capture the entire stdout
-e ...plot the efficiency
-m 1/n.. first / n order markov source (int: n...number of linked characters)
-b <decimal> - "b-ary entropy", choose a different base, default is 256 for ASCII; use 64 for text

The most up to date information on usage is best accessed via the program's source header:
https://github.com/lsauer/entropy/blob/master/shantropy/Main.cs

Usage examples:

  • ent explain.nfo markdownsharp-20100703-v113.7z -b 3,6
  • cat myfile | grep '\d+.+?\d' | ent  -b 2.15 -s
  • ent .\json_testfiles.tar .\cc_by_sa.png .\vmdscene.bmp -e -w 80 -h 12 -m 1
  • ent json_testfiles.tar cc_by_sa.png vmdscene.bmp -e -w 80 -h 12 -m 1


About:

The purpose for this command line tools was to quickly analyze sequence files as well as serialized JSON data (at the time of writing the application). Since the command line along with autocompletion (by pressing Tab) is often times faster than graphical tools and has virtually no startup times, it is my preferred choice for analyzing data.

The tool was written in C#, and comes with cross-platform build files along with project files for Monodevelop.
Ent runs on Linux, MaxOS, Android and Windows.
See Mono for more information.

Notes:
The code is in disarray and will likely be rewritten, after feature completeness is reached. Consider the tool an early alpha build, far from the intended feature set. In bioinformatics many situations arise in which the command line is a very powerful and helpful way of quickly interacting with data, without the need of getting to know complex user interfaces with unfamiliar design.

Ent is able to quickly provide insight in the nature of a data stream or text based file. e.g. structure, even hierarchy, compression/encryption as one will familiarize with certain patterns

The project can be downloaded or forked here:
https://github.com/lsauer/entropy

Installing:
In order to access the program anywhere on the command line you have to add it to the $PATH (linux) or PATH (windows).

Download:



Next version:  

  • png plot output
  • grace plotting script output (end of  '11)
  • PE (portable executable) header inspection w. FPU opcode density (based on byte stream from D8 to DF - windows!) (eventually)
  • Simpsons D and E (Diversity index and Evenness) (end of the month)
  • Smith–Waterman Distance (soon)
  • Hamming distance (soon)
  • eventually Soundex
  • automatic text-mode toggling based on a score (alphabet size, file extension and text characteristics) 
  • option of data range selection (soon)

Resources:


I conclude with a plot of the augustus library for Clamydomonas reinhardtii, which is notoriously highly entropic and additionally non-redundant. You can get the file here.

0,50 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,45 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,40 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,35 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,30 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,25 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,20 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,15 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,10 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,05 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
0,00 | ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
------------------------------------------------------------------
0% 16% 33% 50% 66% 83%

LihatTutupKomentar