I have recently come across an interestng problem while working on my random variable implementation for BIP. Any type in Python is expected to have a ** __str__(self)** method which returns an adequate and expressive string representation of the object. Well, as far as I could think, the most straightforward representation of a random variable is its probability distribution. Probability distributions most often depicted graphically by a continuous density function, or a histogram. So my challenge was how to bring the information conveyed by a histogram to a concise ascii string, suitable to be the output of a print statement?

I immediately rejected the boring solution of representing the distribution by its moments (mean, variance, skewness, etc.). I wanted a full histogram in as few ascii characters as possible. So I set out to implement my own ASCII histogram generator. I can anticipate that it was a very simple task given the handy ** histogram** function in Numpy and how easy it is to do string formatting in Python. It was nevertheless a fun couple of hours of programming. I ended up implementing a horizontal and a vertical histogram. The ascii histogram proved to be very useful since it helped enormously in debugging code involving probability calculations with simple print statements. Probabilistic simulations are extremely hard to test because the results of a given operation are never strictly the same. However, they should have the same probability distribution, so by looking at the rough shape of the histogram, you tell you if your calculations are going in the right direction.

Curiously, such a simple and expressive representation for probability distributions is not available in any package I knew, so I decided to share the code with the scientific Python community so that people that may put it to good use. The code below is part of BIP and consequently under GPL license. Any suggestions of improvements are welcome.

# -*- coding: utf-8 -*-classHistogram(object):

"""

Ascii histogram

"""def__init__(self, data, bins=10):

"""

Class constructor

:Parameters:

- `data`: array like object

"""self.data = dataself.bins = binsself.h = histogram(self.data, bins=self.bins)defhorizontal(self, height=4, character ='|'):

"""Returns a multiline string containing a

a horizontal histogram representation of self.data

:Parameters:

- `height`: Height of the histogram in characters

- `character`: Character to use

>>> d = normal(size=1000)

>>> h = Histogram(d,bins=25)

>>> print h.horizontal(5,'|')

106 |||

|||||

|||||||

||||||||||

|||||||||||||

-3.42 3.09

"""his = """"""bars = self.h[0]/max(self.h[0])*heightforl in reversed(range(1,height+1)):line = ""ifl == height:line = '%s '%max(self.h[0])#histogram top countelse:line = ' '*(len(str(max(self.h[0])))+1)#add leading spacesforc in bars:ifc >= ceil(l):`line += character`

else:line += ' 'line +='\n'`his += line`

his += '%.2f'%self.h[1][0] + ' '*(self.bins) +'%.2f'%self.h[1][-1] + '\n'returnhisdefvertical(self,height=20, character ='|'):

"""

Returns a Multi-line string containing a

a vertical histogram representation of self.data

:Parameters:

- `height`: Height of the histogram in characters

- `character`: Character to use

>>> d = normal(size=1000)

>>> Histogram(d,bins=10)

>>> print h.vertical(15,'*')

236

-3.42:

-2.78:

-2.14: ***

-1.51: *********

-0.87: *************

-0.23: ***************

0.41 : ***********

1.04 : ********

1.68 : *

2.32 :

"""his = """"""xl = ['%.2f'%nforn in self.h[1]]lxl = [len(l)forl in xl]bars = self.h[0]/max(self.h[0])*heighthis += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])fori,c in enumerate(bars):line = xl[i] +' '*(max(lxl)-lxl[i])+': '+ character*c+'\n'`his += line`

returnhis

if__name__ == "__main__":from numpy.random import normald = normal(size=1000)h = Histogram(d,bins=10)

## 8 comments:

This is pretty neat. Why did you choose GPL though? Why not let proprietary users use it?

You might want to check

http://en.wikipedia.org/wiki/Stemplot

It is implemented in R. For example, using 100 Normal random variables you get:

> stem(rnorm(100))

The decimal point is at the |

-2 | 31

-1 | 9877

-1 | 443332110000

-0 | 9988877766655

-0 | 4433333221100

0 | 000111111111222222333444

0 | 55556666777799

1 | 001122233334

1 | 555

2 | 023

@jsalvati: because proprietary developers don't give back to the community.

@Adrian: I was refering to Python, not R. But I still think my histogram looks nicer than R's stem plot.

Looks great. Maybe you should mention it on the scipy/numpy forums, am sure they would like to integrate it.

Thank you so much for posting this.

I had to make a few changes so it would for me:

* import numpy

* import math

* use float() to get floating-point division

here's the result of a diff with --unified=2

--- histogram.py 2010/10/19 17:03:11 1.1

+++ histogram.py 2010/10/19 17:42:17 1.2

@@ -9,4 +9,7 @@

#

#

+import numpy as np

+import math

+

class Histogram(object):

"""

@@ -18,9 +21,10 @@

:Parameters:

- - `data`: array like object

+ - `data`: array-like object

+ - `bins`: number of bins (default 10)

"""

self.data = data

self.bins = bins

- self.h = histogram(self.data, bins=self.bins)

+ self.h = np.histogram(self.data, bins=self.bins)

def horizontal(self, height=4, character ='|'):

"""Returns a multiline string containing a

@@ -40,5 +44,5 @@

"""

his = """"""

- bars = self.h[0]/max(self.h[0])*height

+ bars = self.h[0]/float(max(self.h[0]))*height

for l in reversed(range(1,height+1)):

line = ""

@@ -48,5 +52,5 @@

line = ' '*(len(str(max(self.h[0])))+1) #add leading spaces

for c in bars:

- if c >= ceil(l):

+ if c >= math.ceil(l):

line += character

else:

@@ -81,5 +85,5 @@

xl = ['%.2f'%n for n in self.h[1]]

lxl = [len(l) for l in xl]

- bars = self.h[0]/max(self.h[0])*height

+ bars = self.h[0]/float(max(self.h[0]))*height

his += ' '*(max(bars)+2+max(lxl))+'%s\n'%max(self.h[0])

for i,c in enumerate(bars):

is there really no way to format a comment with a fixed-width font?Could you be specific about which version or versions of the GPL you intend? Thanks!

I like your post; it definitely beats looking a histogram as a list of numbers!

I ran into a compatibility problem with older versions of python (2.6 for me but should manifest in anything without future division). The line

bars = self.h[0]/max(self.h[0])*height

could be changed to

bars = self.h[0]*height/max(self.h[0])

(i.e. do the multiplication before the division) in order to not lose precision from integer division.

Post a Comment