Binning in Data Mining

Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. This has a smoothing effect on the input data and may also reduce the chances of overfitting in the case of small datasets
There are 2 methods of dividing data into bins:
- Equal Frequency Binning: bins have an equal frequency.
- Equal Width Binning : bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
Equal frequency:
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] Output: [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215]
Equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215]
Code : Implementation of Binning Technique:
Python
# equal frequencydef equifreq(arr1, m): a = len(arr1) n = int(a / m) for i in range(0, m): arr = [] for j in range(i * n, (i + 1) * n): if j >= a: break arr = arr + [arr1[j]] print(arr) # equal widthdef equiwidth(arr1, m): a = len(arr1) w = int((max(arr1) - min(arr1)) / m) min1 = min(arr1) arr = [] for i in range(0, m + 1): arr = arr + [min1 + w * i] arri=[] for i in range(0, m): temp = [] for j in arr1: if j >= arr[i] and j <= arr[i+1]: temp += [j] arri += [temp] print(arri) # data to be binneddata = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] # no of binsm = 3 print("equal frequency binning")equifreq(data, m) print("\n\nequal width binning")equiwidth(data, 3) |
Output :
equal frequency binning [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215] equal width binning [[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]



