Knowledge BaseSpatial TechnologyParsing CAD, BIM, and PDF Floorplans: How Platforms Extract Spatial Data
Spatial Technology20 min read
floorplan parsingCAD file parsingDXF parsingDWG extractionIFC BIM parsingPDF floorplan extraction

Parsing CAD, BIM, and PDF Floorplans: How Platforms Extract Spatial Data

Floorplan parsing is the computational process of reading architectural drawing files and extracting structured, machine-readable spatial data from them. A raw CAD file or PDF contains lines, arcs, text strings, and hatch patterns, but it does not explicitly declare "this polyline is a wall" or "this region is a room." Parsing bridges that gap: it interprets low-level graphical primitives and produces a semantic model of the building, walls with thicknesses and materials, doors with swing directions, rooms with names and areas, and text annotations with positions and content. This semantic model is the foundation for every downstream application, from generative floorplan design and canvas-based editing to digital-twin construction. This article examines every stage of the parsing pipeline, the file formats involved, the technical challenges, and the data models that result.


Table of Contents

  1. Why Floorplan Parsing Matters
  2. Floorplan File Formats
  3. Coordinate Systems and Scale
  4. Layer Structures in CAD Files
  5. Entity Extraction
  6. Scale Detection and Calibration
  7. OCR for Legacy Plans
  8. Georeferencing
  9. Challenges in Floorplan Parsing
  10. The Parsing Pipeline
  11. Data Models for Spatial Entities
  12. Key Takeaways
  13. Frequently Asked Questions
  14. Next Steps

Why Floorplan Parsing Matters

Every spatial intelligence workflow, whether it involves placing signs, optimising furniture layouts, tracking assets, or monitoring occupancy, depends on an accurate digital representation of the physical space. Without parsing, that representation does not exist. A PDF of a floorplan is visually useful to a human reader but opaque to software. A DXF file contains geometric primitives but not spatial semantics. Parsing converts these raw inputs into a structured data model that algorithms can reason about: querying rooms by name, measuring distances between points, identifying corridors, and detecting decision points for wayfinding.

The quality of the parsed data determines the quality of everything built on top of it. Inaccurate wall detection produces incorrect room boundaries. Missing doors create false dead ends in circulation analysis. Wrong scale factors produce layouts where a 3-metre corridor appears as 30 metres. For these reasons, robust floorplan parsing is a critical capability of any spatial platform.

Floorplan parsing, sometimes called architectural drawing extraction or spatial data extraction, is the process by which software interprets the graphical content of architectural drawings and transforms it into a structured, queryable data model containing walls, doors, rooms, windows, and annotations with their geometric properties and spatial relationships.


Floorplan File Formats

DXF (Drawing Exchange Format)

DXF is an ASCII or binary file format created by Autodesk to enable interoperability between CAD programs. It stores entities (lines, polylines, arcs, circles, text, dimensions, blocks, and hatches) organised into layers. DXF is the most common interchange format for 2D architectural drawings and is supported by virtually every CAD application.

From a parsing perspective, DXF is relatively tractable because its structure is documented and open. Each entity has a type code, coordinates, layer assignment, and optional attributes. The primary entities encountered in architectural DXF files include:

  • LINE: a single straight segment defined by start and end points.
  • LWPOLYLINE / POLYLINE: a connected sequence of line and arc segments, commonly used for wall outlines, room boundaries, and complex shapes.
  • ARC and CIRCLE: curved entities used for door swings, column outlines, and rounded walls.
  • INSERT: a block reference that places a pre-defined group of entities (a "block") at a specific location with scale and rotation. Doors, windows, furniture, and equipment are typically represented as block insertions.
  • TEXT and MTEXT: single-line and multi-line text entities carrying room names, dimensions, annotations, and notes.
  • HATCH: filled regions indicating materials, floor finishes, or solid construction.
  • DIMENSION: measurement annotations linking two points with a numerical value.

The challenge lies in interpreting the semantic meaning of entities: a polyline on the "A-WALL" layer is probably a wall, but a polyline on an unnamed layer could be anything.

DWG (Drawing)

DWG is Autodesk's proprietary binary format, used natively by AutoCAD. It contains the same entity types as DXF but in a more compact, undocumented binary encoding. Parsing DWG files requires either using Autodesk's libraries (such as RealDWG or the ODA Teigha libraries) or converting to DXF first. Because DWG is the working format for most architecture practices, reliable DWG ingestion is essential for any parsing platform.

DWG files frequently contain external references (XREFs) that link to other DWG files, such as structural grids, mechanical layouts, or site plans. A robust parser must resolve these references and merge the relevant geometry into a unified model, or at minimum identify and skip non-architectural XREF layers.

IFC (Industry Foundation Classes)

IFC is the open standard for BIM (Building Information Modelling) data exchange, maintained by buildingSMART International. Unlike DXF and DWG, IFC is semantically rich: it explicitly defines walls (IfcWall), doors (IfcDoor), spaces (IfcSpace), windows (IfcWindow), slabs (IfcSlab), and their relationships through IfcRelSpaceBoundary and IfcRelConnectsElements entities. Parsing IFC is therefore easier in principle because the semantic classification is already done.

The challenge lies in the complexity of the IFC schema (over 800 entity types in IFC4) and the inconsistency of real-world IFC exports. Common issues include:

  • Missing or incorrect IfcSpace boundaries, requiring geometric reconstruction from wall geometry.
  • Inconsistent use of IfcBuildingStorey for floor-level organisation.
  • Over-detailed models that include structural, mechanical, and electrical elements alongside architectural geometry, requiring filtering.
  • Varying geometric representation types (boundary representation, swept solids, clipping) that must be tessellated to 2D for floorplan display.

PDF (Portable Document Format)

PDF is the most common delivery format for architectural drawings. Contractors, facilities managers, and building owners typically receive PDFs rather than native CAD files. PDF floorplans come in two varieties:

  • Vector PDFs: contain drawing instructions (lines, curves, fills, text rendering commands) that can be extracted programmatically using PDF content stream parsing. The geometry is precise, but semantic information (layer names, block definitions) is lost during the export from CAD. Coordinate extraction from vector PDFs involves reading the content stream operators (m, l, c for move, line, curve) and transforming them through the current transformation matrix (CTM) to obtain page-space coordinates.
  • Raster PDFs: contain scanned images of printed drawings. Extracting geometry from raster PDFs requires optical character recognition (OCR) for text and computer-vision techniques (edge detection, Hough transforms) for lines and shapes. Accuracy is lower and processing is slower.

SVG (Scalable Vector Graphics)

SVG is an XML-based vector format used primarily for web display. Some platforms export floorplans as SVG for browser rendering. SVG preserves geometry and supports metadata through custom attributes, but it is not a standard architectural exchange format and lacks the layer structure of DXF.


Coordinate Systems and Scale

World Coordinates vs Page Coordinates

CAD files store geometry in "model space" (world coordinates), which may use metres, millimetres, feet, or inches. The coordinate system is typically Cartesian with an arbitrary origin set by the drafter. PDF files, by contrast, store geometry in "page space" (points, where 1 point = 1/72 inch), with no inherent connection to real-world units. A parsing pipeline must determine the correct coordinate system and unit for each file and transform all geometry into a consistent output coordinate system.

The distinction between world coordinates and page coordinates is fundamental to floorplan parsing. World coordinates represent real-world measurements and positions. Page coordinates represent positions on a printed or displayed page. When a CAD file is exported to PDF, the world-to-page transformation is applied, and the information needed to reverse it (the scale factor, the origin offset) is typically lost. The parser must reconstruct this mapping.

Coordinate Origin

The origin (0, 0) in a CAD file is determined by the drafter. Some place it at the building's lower-left corner; others place it at a survey datum hundreds of metres away. Parsing pipelines normalise coordinates so that the floorplan's bounding box starts at or near the origin, ensuring consistent downstream processing.

Rotation and Mirroring

Drawings are sometimes rotated so that north faces up, or mirrored to show a reflected ceiling plan. The parser must detect and correct these transformations to produce a consistently oriented output. Detection heuristics include checking text orientation (text should read left-to-right in the output), comparing compass annotations, and analysing room label positions.

Unit Detection

DXF files include a header variable ($INSUNITS) that specifies the drawing units, but this variable is frequently set incorrectly. Robust parsers cross-check $INSUNITS against known architectural dimensions: if a door opening measures 0.9 when units are set to metres, the value is plausible; if it measures 900, the true units are probably millimetres. This heuristic approach to unit detection is essential for handling real-world files where metadata cannot be trusted.


Layer Structures in CAD Files

CAD drawings organise entities into layers, and layer names often indicate the entity's function. Common architectural layer-naming conventions include AIA (American Institute of Architects) and BS 1192 (British Standards):

  • A-WALL: walls
  • A-DOOR: doors
  • A-GLAZ: glazing
  • A-FURN: furniture
  • A-ANNO: annotations and text
  • A-DIMS: dimensions
  • A-FLOR: floor finishes
  • A-COLS: structural columns
  • A-STRS: stairs
  • A-CEIL: ceiling elements

A well-structured CAD file uses consistent layer names, making it straightforward for a parser to classify entities by layer. In practice, however, many CAD files are poorly structured: entities are placed on the wrong layers, custom layer names are used without documentation, or all entities are placed on a single "0" layer.

Robust parsing must handle these cases through a combination of:

  • Layer-name pattern matching: recognising common naming conventions and their variants (e.g., "WALL", "A-WALL", "A-Wall-Int", "Walls").
  • Geometric heuristics: classifying entities by their geometric properties (e.g., pairs of parallel lines with consistent spacing are likely walls, regardless of layer).
  • Machine-learning classifiers: trained models that predict entity type from geometric features, contextual position, and layer metadata.

Modern spatial infrastructure software like Plotstuff addresses layer inconsistency by supporting configurable layer-mapping rules that can be adjusted per project or per source firm, combined with intelligent defaults that handle the most common conventions automatically.


Entity Extraction

Entity extraction is the core of the parsing process: converting raw graphical primitives into semantic building elements.

Wall Detection

Walls are typically represented as pairs of parallel lines (single-line walls), closed polylines (wall outlines), or hatched regions. The wall detection algorithm identifies wall candidates by:

  • Filtering entities on wall-related layers.
  • Detecting pairs of parallel lines separated by a plausible wall thickness (100-400 mm for interior partitions, 200-600 mm for exterior walls).
  • Identifying hatched regions where the hatch pattern indicates solid construction (solid fill, concrete hatch, or brick hatch).
  • Analysing line weight: thicker lines often represent walls in architectural drawings.

The output is a set of wall segments with start points, end points, thicknesses, and connectivity (which walls meet at corners). Corner resolution, determining how walls join at intersections, is a non-trivial geometric problem that requires snapping endpoints within a tolerance and resolving T-junctions, L-junctions, and cross-junctions.

Door and Window Detection

Doors appear in CAD files as blocks (named groups of entities) inserted at specific locations. The parser identifies door blocks by:

  • Their block names (which often include "DOOR", "DR", "D-", or similar prefixes).
  • Their geometric signatures (an arc representing the swing plus a line representing the leaf).
  • Their insertion points on wall segments (doors always occur in wall gaps).

Similarly, windows are identified by block names or by gap patterns in wall lines with parallel lines representing the glazing within the gap.

Room Detection

Rooms are not always explicitly drawn as closed polygons. The parser reconstructs room boundaries by:

  1. Building a planar graph from wall segments.
  2. Identifying enclosed regions (faces) in the graph using a face-finding algorithm that traverses edges counter-clockwise.
  3. Matching room labels (text entities) to the regions they fall within using point-in-polygon tests.

This process is straightforward when walls form clean, closed loops. When walls have gaps, overlapping segments, or missing corners, the parser must apply heuristic gap-closing algorithms, extending line segments to their nearest intersection within a tolerance, before room boundaries can be computed.

Text Extraction

Text entities carry room names, dimensions, notes, and annotations. The parser extracts each text string along with its position, rotation, font size, and layer. For vector PDFs, text extraction uses the PDF content stream text operators (Tj, TJ, Td); for raster PDFs, OCR is required. Extracted text is then associated with spatial entities: a text string positioned inside a room boundary is likely the room name.


Scale Detection and Calibration

When a drawing lacks reliable unit metadata (common with PDFs and poorly configured DXF files), the parser must infer the scale.

Dimension-Line Analysis

If the drawing includes dimension annotations (lines with numerical labels indicating real-world measurements), the parser can compute the scale by comparing the graphical length of the dimension line to the numerical value in its label. For example, if a dimension line spans 150 drawing units and is labelled "3000 mm," the scale is 1 drawing unit = 20 mm.

Known-Object Heuristics

Standard architectural elements have predictable sizes: single doors are approximately 900 mm wide, double doors 1800 mm, standard desks 1600 x 800 mm, and car-park spaces 2400 x 4800 mm. If the parser identifies a door block whose graphical width is 45 drawing units, the implied scale is 1 unit = 20 mm.

User-Assisted Calibration

When automated methods fail, the platform can prompt the user to identify a known dimension on the drawing. Plotstuff, for example, provides a calibration tool within its canvas-based editor where the user draws a line between two known points, enters the real-world distance, and the system calculates the scale factor automatically. This hybrid approach, automated where possible, user-assisted where necessary, ensures reliable calibration across the full range of file qualities encountered in practice.


OCR for Legacy Plans

Many building portfolios include legacy plans that exist only as scanned paper drawings or raster PDF files. Extracting spatial data from these documents requires optical character recognition (OCR) combined with computer-vision techniques.

Text Recognition

OCR engines (such as Tesseract, Google Vision API, or AWS Textract) identify text regions in the scanned image and convert them to machine-readable strings. Architectural drawings present specific OCR challenges:

  • Small text at reduced scan resolution.
  • Mixed orientations (room names may be rotated to follow wall directions).
  • Overlapping text and graphical elements.
  • Non-standard fonts and hand-lettering on older drawings.
  • Dimension text mixed with annotation text, requiring semantic classification after extraction.

Line and Shape Detection

Computer-vision algorithms detect geometric features in raster images:

  • Canny edge detection identifies boundaries between regions of different intensity.
  • Hough transform detects straight lines and circles from edge pixels.
  • Contour detection identifies closed shapes that may represent rooms.
  • Template matching locates known symbols (door swings, stair indicators, furniture blocks) by comparing image patches against a library of templates.

Accuracy and Limitations

OCR-based parsing of legacy plans is significantly less accurate than vector parsing. Typical challenges include false positives (detecting lines from fold marks or stains), missed features (faint lines that fall below the detection threshold), and text misrecognition. Results require extensive manual review and correction. Nevertheless, OCR parsing provides a practical path for digitising legacy building portfolios that would otherwise remain inaccessible to modern spatial workflows.


Georeferencing

Georeferencing assigns real-world geographic coordinates (latitude, longitude, or a projected coordinate system) to the floorplan. This is necessary for applications that combine indoor and outdoor spatial data, such as campus navigation, emergency-response mapping, and IoT sensor integration.

Sources of Georeference Data

  • BIM files: IFC files may include an IfcSite entity with geographic coordinates.
  • Survey data: the architect's survey may specify the building's position relative to a national grid.
  • Manual alignment: the user places the floorplan on a map and adjusts its position and rotation.

Transformation

Georeferencing applies an affine transformation (translation, rotation, and scaling) to the floorplan's local coordinates, mapping them onto a geographic coordinate system. The transformation is defined by at least two control points with known local and geographic coordinates.


Challenges in Floorplan Parsing

Messy CAD Files

Real-world CAD files are frequently messy: overlapping lines, entities on wrong layers, inconsistent scales, orphaned blocks, and undocumented custom layer names. A parser that works perfectly on clean test files may fail on production files. Robustness requires extensive error handling, heuristic fallbacks, and the ability to surface parsing issues for manual review.

Scanned and Raster PDFs

Scanned PDFs present the hardest parsing challenge. Lines must be detected from pixel data using edge-detection algorithms, text must be extracted via OCR (which is error-prone for architectural notation), and the entire process is slower and less accurate than vector parsing. Many real-world floorplan archives contain scanned drawings from the 1970s-1990s that have never been digitised.

Multi-Page and Multi-Scale Drawings

Architectural drawing sets often contain multiple views at different scales on the same page (e.g., a full floor plan at 1:100 alongside detail sections at 1:20). The parser must identify viewport boundaries and apply the correct scale to each region independently.

Revisions and Redlines

Drawings evolve through revisions. A DXF file may contain superseded geometry that has been moved to a "XREF" or "OLD" layer but not deleted. The parser must filter out revision artefacts to produce a clean current-state model.

Inconsistent Standards

Different architecture firms use different layer-naming conventions, block libraries, and annotation styles. A parser tuned to AIA layer standards will not work on files that follow BS 1192 conventions without adaptation. Modern spatial infrastructure software like Plotstuff addresses this by supporting configurable layer-mapping rules that can be adjusted per project or per source firm.


The Parsing Pipeline

A complete parsing pipeline progresses through several stages:

Stage 1: File Ingestion

Read the file, detect its format (DXF, DWG, IFC, PDF, SVG), and load it into an in-memory representation. For DWG files, convert to DXF or use an ODA-based reader. For PDFs, determine whether the content is vector or raster.

Stage 2: Entity Extraction

Extract all graphical primitives (lines, polylines, arcs, circles, text, blocks, hatches) and their attributes (layer, colour, line type, position).

Stage 3: Layer Classification

Map layers to semantic categories (wall, door, furniture, annotation, dimension) using layer-name matching, configurable mapping rules, and fallback heuristics.

Stage 4: Geometric Processing

Clean the geometry: snap near-miss endpoints, merge collinear segments, close small gaps in wall loops, remove duplicate entities, and normalise coordinates.

Stage 5: Semantic Interpretation

Apply wall-detection, door-detection, room-detection, and text-association algorithms to produce a semantic building model.

Stage 6: Scale and Calibration

Determine the drawing scale using dimension-line analysis, known-object heuristics, or user-assisted calibration.

Stage 7: Validation and Review

Present the parsed model to the user in a visual editor for review. Highlight potential issues (unclosed rooms, unclassified entities, ambiguous labels) and allow manual corrections.

Stage 8: Export

Serialise the validated model as structured data (JSON, GeoJSON) for consumption by downstream applications: generative-design engines, asset-management systems, wayfinding solvers, and digital-twin platforms.


Data Models for Spatial Entities

The output of a parsing pipeline is a structured data model. A well-designed model includes the following entity types:

Floor

The top-level container, representing a single storey of a building. Properties include floor number, name, elevation, and gross area.

Room (Space)

A bounded region of the floor with a name, function (office, corridor, toilet, stairwell), net area, and boundary polygon (a closed list of coordinate pairs). Rooms are the primary units of spatial analysis.

Wall

A line segment or polyline with start point, end point, thickness, material (if known), and references to the rooms on each side. Walls form the edges of the room graph.

Door

A point or short segment on a wall, with type (single, double, sliding, revolving), width, swing direction, and references to the rooms it connects.

Window

Similar to a door but without passage: position on wall, width, height (if available), and sill height.

Text Annotation

A string with position, rotation, font size, and semantic role (room name, dimension value, general note).

Block Instance

A named group of entities placed at a specific position with scale and rotation. Blocks represent furniture, fixtures, equipment, and symbols.

Dimension

A measurement annotation linking two points with a numerical value and unit. Dimensions are critical for scale calibration.

This data model, once populated, becomes the single source of truth for all spatial operations on that floor, from generative design to asset tracking, inspection workflows, and wayfinding analysis.


Key Takeaways

  • Floorplan parsing converts raw drawing files (DXF, DWG, IFC, PDF, SVG) into structured, machine-readable spatial data containing walls, doors, rooms, windows, and annotations.
  • The process involves file ingestion, entity extraction, layer classification, geometric cleaning, semantic interpretation, scale calibration, validation, and export.
  • Coordinate systems, unit detection, and scale calibration are pervasive challenges that require both automated heuristics and user-assisted tools.
  • Real-world CAD files are frequently messy, with entities on wrong layers, gaps in wall geometry, and inconsistent naming conventions. Robust parsers must handle these edge cases.
  • OCR and computer-vision techniques enable parsing of legacy scanned plans, though at lower accuracy than vector sources.
  • The output is a structured data model containing floors, rooms, walls, doors, windows, text, blocks, and dimensions, forming the foundation for all downstream spatial intelligence workflows.
  • Scanned and raster PDFs present the most difficult parsing challenges, requiring computer-vision techniques and OCR.

Frequently Asked Questions

Can a parser handle any CAD file automatically?

No parser achieves 100% accuracy on all files. Well-structured files following standard layer-naming conventions parse accurately with minimal intervention. Messy files require manual review and correction. The best platforms combine automated parsing with interactive review tools that let users fix issues visually.

What is the difference between parsing a DXF and parsing a PDF?

DXF files contain explicit geometric entities (lines, polylines, arcs) with layer assignments, making extraction relatively straightforward. PDF files contain rendering instructions that must be reverse-engineered to recover geometry. Vector PDFs yield precise results; raster (scanned) PDFs require image-processing techniques and produce lower-accuracy results. The key difference is that DXF preserves the original entity structure and layer organisation, while PDF flattens everything into rendering operations.

How do platforms handle BIM files differently from CAD files?

BIM files (IFC) already contain semantic information: walls, doors, and spaces are explicitly classified. Parsing IFC is therefore more about navigating the complex IFC schema and handling inconsistencies in exports than about inferring semantics from geometry. CAD files require the parser to infer semantics from layer names, geometric patterns, and heuristics.

What happens when the drawing scale is unknown?

The parser attempts to infer the scale from dimension annotations or known-object sizes. If automated detection fails, the user is prompted to calibrate manually by identifying a known measurement on the drawing. Without correct scale, all downstream measurements (room areas, travel distances, clearance checks) will be incorrect.

How does floorplan parsing relate to digital twins?

Floorplan parsing produces the static spatial model, the base geometry and room data, of a building. This model becomes the foundation layer of a digital twin, onto which live data (sensor readings, occupancy counts, maintenance records) is overlaid. Without accurate parsing, the digital twin has no reliable spatial reference. See From Floorplan to Digital Twin for the full progression.


Next Steps

If you need to bring architectural drawings into a digital workflow, start by gathering your source files and identifying their formats. Prioritise vector sources (DXF, DWG, IFC) over raster PDFs where possible. Use a platform like Plotstuff that provides automated parsing, interactive calibration, and visual review tools within a browser-based environment. Once your floorplans are parsed into structured data, you can apply generative design, build digital twins, and edit layouts in a canvas-based editor without ever opening a desktop CAD application.

Related Articles

Spatial Technology

What Is Generative Design in Architecture and Facilities?

Generative design uses algorithms and constraints to automatically explore thousands of spatial layout options for architecture and facilities management. This guide covers the definition, history, parametric vs generative approaches, rule-based and AI-driven methods, and practical applications including space planning, furniture layout, and signage placement.

Spatial Technology

Canvas-Based Floorplan Editing: Why In-Browser Tools Are Replacing AutoCAD

Canvas-based floorplan editing uses HTML5 Canvas, WebGL, and SVG rendering within web browsers to let teams view, annotate, and modify building layouts without desktop CAD installations. This guide covers the rendering technologies, why browser-based tools matter for accessibility and collaboration, the Konva/Fabric.js/Three.js ecosystem, real-time collaboration, layer management, coordinate systems, performance optimisation, a comparison with desktop CAD, and limitations with practical workarounds.

Spatial Technology

From Floorplan to Digital Twin: Turning Drawings into Live Building Data

A digital twin transforms a static floorplan into a live, data-connected representation of a building by integrating sensor feeds, IoT devices, and operational systems. This guide covers the digital twin definition, maturity levels, the progression from static drawings to live data, sensor integration, IoT connectivity, asset tracking on floorplans, real-time occupancy, maintenance workflows, energy management, an implementation roadmap, and the ROI of digital twins.

Spatial Technology

How Generative Floorplan Design Works (Rules, Constraints, Outputs)

Generative floorplan design combines spatial rules, regulatory constraints, and optimisation algorithms to produce multiple viable layout options from a single set of inputs. This guide covers input requirements, constraint types, rule engines, optimisation methods, output formats, evaluation criteria, iteration workflows, and practical examples including sign placement, furniture layout, and safety equipment positioning.