Reference Info for C44


CATALOGBANK: A STRUCTURED AND INTEROPERABLE CATALOG DATASET WITH A SEMI-AUTOMATIC ANNOTATION TOOL (DOCUMENTLABELER) FOR ENGINEERING SYSTEM DESIGN

S. Bank, D. R. Herber


[doi] [pdf] [arXiv:2408.08238] [CatalogBank] [DocumentLabeler] [slides] [recording] Nominated for Best Paper Award

Text Reference:

S. Bank, D. R. Herber. 'CatalogBank: a structured and interoperable catalog dataset with a semi-automatic annotation tool (DocumentLabeler) for engineering system design.' In ACM 2024 Symposium on Document Engineering (DocEng), San Jose, CA, USA, Aug 2024. doi: 10.1145/3685650.3685665

BibTeX Source:

@inproceedings{Bank2024a,
  author    = {Bank, Sinan and Herber, Daniel R},
  title     = {{CatalogBank}: a structured and interoperable catalog dataset with a semi-automatic annotation tool {(DocumentLabeler)} for engineering system design},
  booktitle = {ACM 2024 Symposium on Document Engineering (DocEng)},
  address   = {San Jose, CA, USA},
  month     = aug,
  year      = {2024},
  doi       = {10.1145/3685650.3685665},
  pdf       = {https://arxiv.org/pdf/2408.08238.pdf},
}

Abstract:

In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation tool adapted for our dataset, we demonstrated the potential of CatalogBank in supporting diverse document-based tasks such as layout analysis and knowledge extraction. Our findings suggest that CatalogBank can contribute to document engineering and NLP by providing a robust dataset for training models capable of understanding and processing complex document formats with relatively less effort using the semi-automated annotation tool DocumentLabeler.