A data driven solution for text grouping.
Time is the most valuable thing that a man could spend. So better spending it the right way, doing the most valuable things. And spending it coding things the old fashioned way is clearly not one of my hobbies.
Here, I share a solution to the following problematic:
“Having a bunch of bounding boxes delimiting words of a given text, we want to identify the less possible blocks of text.”
|# Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import cv2
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors|# Importing target img & bounding boxes
img = cv2.imread(IMG_PATH)
boxes = pd.read_csv(BB_PATH,names = ["x","y","X","Y"])
I ll experiment on this dummy website, the detection state is as shown bellow.
plt.figure(figsize=(15,15))
plt.imshow(img)
The bounding boxes are detected using EAST neural network. Here I am only using 4 features:
- x : x coordinate of the bounding boxe upper left corner
- y : y coordinate of the bounding boxe upper left corner
- X : x coordinate of the bounding boxe down right corner
- Y : the down right y coordinate of the bounding boxe down right corner
boxes.head()
plt.figure(figsize = (15,10))
plt.subplot(1,3,1)
plt.scatter(boxes.X,boxes.Y)
plt.xlabel("X")
plt.ylabel("Y")
plt.subplot(1,3,2)
plt.scatter(boxes.x,boxes.y)
plt.xlabel("x")
plt.ylabel("y");
Since paragraphs represent regions with high density, where other detected words or small sentences lay in regions with low density, it is easy to seperate them, a well suited algorithm for this purpose is DBSCAN. It is good at seperating target data (High density;paragraphs) from noise data(low density;small group of text).
In my case the number of paragraphs could vary from input image to another, so that’s a second reason why I’ll be using a clustering algorithm that doesn’t require the specification of the clusters number: DBSCAN. However I should admit the fact that the prediction depends on a hyperparameter epsilon, which is the only parameter to set. But here, it is not a big deal. In the following section, I’ll show how to get a suitable epsilon for a given bounding boxes dataset.
Our data has 4 features, but DBSCAN doesn’t deal really well with high dimensionality data. So I’ll fit it only on (x,y) or (X,Y).
# Extracting features
data = boxes.iloc[:,0:2]# Getting epsilon
# I am considering only regions with 4 words
NN = NearestNeighbors(n_neighbors = 3, metric = "euclidean")
distances, indices = NN.fit(data).kneighbors(data)
distances = np.sort(distances,axis = 0)[:,1]
gradient = np.gradient(distances)# A good starting value for epsilon would be the distance right before the exploding gradient
plt.plot(distances)
plt.plot(gradient, label = "gradient")
plt.xlabel("Boxe")
plt.ylabel("Distance")
plt.legend();
Clearly a good value of epsilon to start with is 100.
# Fitting DBSCAN on data
clustering = DBSCAN(eps = 100, min_samples=3)
y = clustering.fit(data).labels_# Plotting result
plt.scatter(boxes.x,boxes.y,c = y)
plt.xlabel("x")
plt.ylabel("y");
Purple boxes corners are considered by DBSCAN as a nosie data, which means in my case 2 to 3 words sentences.
Result:
The results are quite satistifying for the purpose of my project, that goes beyond the text detection.
# Input
INPUT = cv2.imread("C:\\Users\\Otmane\\Desktop\\IN.png")# Output
OUT = cv2.imread("C:\\Users\\Otmane\\Desktop\\OUT.png")plt.figure(figsize=(20,25))
plt.subplot(1,2,1)
plt.imshow(INPUT)
plt.subplot(1,2,2)
plt.imshow(OUT)