| --- |
| library_name: transformers |
| pipeline_tag: image-feature-extraction |
| --- |
| ## TextNet-T/S/B: Efficient Text Detection Models |
|
|
| ### **Overview** |
| [TextNet](https://arxiv.org/abs/2111.02394) is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants **TextNet-T**, **TextNet-S**, and **TextNet-B** (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed. |
|
|
| ### **Performance** |
| TextNet achieves state-of-the-art results in text detection, outperforming hand-crafted models in both accuracy and speed. Its architecture is highly efficient, making it ideal for GPU-based applications. |
|
|
| ### How to use |
| ### Transformers |
| ```bash |
| pip install transformers |
| ``` |
|
|
| ```python |
| import torch |
| import requests |
| from PIL import Image |
| from transformers import AutoImageProcessor, AutoBackbone |
| |
| url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| image = Image.open(requests.get(url, stream=True).raw) |
| |
| processor = AutoImageProcessor.from_pretrained("jadechoghari/textnet-tiny") |
| model = AutoBackbone.from_pretrained("jadechoghari/textnet-base") |
| |
| inputs = processor(image, return_tensors="pt") |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| ``` |
| ### **Training** |
| We first compare TextNet with representative hand-crafted backbones, |
| such as ResNets and VGG16. For a fair comparison, |
| all models are first pre-trained on IC17-MLT [52] and then |
| finetuned on Total-Text. The proposed |
| TextNet models achieve a better trade-off between accuracy |
| and inference speed than previous hand-crafted models by a |
| significant margin. In addition, notably, our TextNet-T, -S, and |
| -B only have 6.8M, 8.0M, and 8.9M parameters respectively, |
| which are more parameter-efficient than ResNets and VGG16. |
| These results demonstrate that TextNet models are effective for |
| text detection on the GPU device. |
|
|
| ### **Applications** |
| Perfect for real-world text detection tasks, including: |
| - Natural scene text recognition |
| - Multi-lingual and multi-oriented text detection |
| - Document text region analysis |
|
|
| ### **Contribution** |
| This model was contributed by [Raghavan](https://huggingface.co/Raghavan), |
| [jadechoghari](https://huggingface.co/jadechoghari) |
| and [nielsr](https://huggingface.co/nielsr). |