Convert DOCBOOK to TSV
Max file size 100mb.
DocBook vs TSV Format Comparison
| Aspect | DocBook (Source Format) | TSV (Target Format) |
|---|---|---|
| Format Overview |
DocBook
XML-Based Documentation Format
DocBook is an XML-based semantic markup language designed for technical documentation. Originally developed by HaL Computer Systems and O'Reilly Media in 1991, it is now maintained by OASIS. DocBook defines elements for books, articles, chapters, sections, tables, code listings, and more. Technical Docs XML-Based |
TSV
Tab-Separated Values
TSV is a plain text format for storing tabular data where columns are separated by tab characters and rows by newlines. TSV is simpler than CSV because tabs rarely appear in data, reducing the need for quoting and escaping. It is widely used for data exchange between databases, spreadsheets, and data analysis tools. Tabular Data Plain Text |
| Technical Specifications |
Structure: XML-based semantic markup
Encoding: UTF-8 XML Standard: OASIS DocBook 5.1 Schema: RELAX NG, DTD, W3C XML Schema Extensions: .xml, .dbk, .docbook |
Structure: Tab-delimited rows and columns
Encoding: UTF-8, ASCII Delimiter: Tab character (\t, U+0009) Standard: IANA text/tab-separated-values Extensions: .tsv, .tab |
| Syntax Examples |
DocBook data table: <table xmlns="http://docbook.org/ns/docbook">
<title>Server Inventory</title>
<tgroup cols="3">
<thead>
<row>
<entry>Hostname</entry>
<entry>IP</entry>
<entry>Role</entry>
</row>
</thead>
<tbody>
<row>
<entry>web-01</entry>
<entry>10.0.1.10</entry>
<entry>Web Server</entry>
</row>
</tbody>
</tgroup>
</table>
|
TSV output (tabs shown as arrows): Hostname IP Role web-01 10.0.1.10 Web Server db-01 10.0.1.20 Database cache-01 10.0.1.30 Redis Cache |
| Content Support |
|
|
| Advantages |
|
|
| Disadvantages |
|
|
| Common Uses |
|
|
| Best For |
|
|
| Version History |
Introduced: 1991 (HaL Computer Systems / O'Reilly)
Current Version: DocBook 5.1 (OASIS Standard) Status: Mature, actively maintained Evolution: SGML origins, migrated to XML |
Introduced: 1960s (tab-delimited data concept)
IANA Registration: 1993 (text/tab-separated-values) Status: Stable, universally supported Evolution: Unchanged since initial specification |
| Software Support |
Editors: Oxygen XML, XMLmind, Emacs
Processors: Saxon, xsltproc, Apache FOP Validators: Jing, xmllint, Xerces Other: Pandoc, DocBook XSL stylesheets |
Spreadsheets: Excel, Google Sheets, LibreOffice Calc
Languages: Python csv, pandas; R read.delim Databases: MySQL LOAD DATA, PostgreSQL COPY Other: Any text editor, Unix tools (cut, awk) |
Why Convert DocBook to TSV?
Converting DocBook to TSV extracts tabular data from structured technical documentation into a simple, flat format that can be immediately opened in spreadsheet applications or imported into databases. DocBook documents frequently contain data tables with inventories, specifications, test results, and reference data that are more useful in a spreadsheet-compatible format for analysis and manipulation.
TSV (Tab-Separated Values) offers advantages over CSV for data exchange because tab characters rarely appear in natural text data, eliminating most quoting and escaping issues. When you copy data from a spreadsheet and paste it into a text editor, the result is naturally TSV. This simplicity makes TSV ideal for data that contains commas, addresses, or descriptive text in cell values.
The conversion process identifies all <table> and <informaltable> elements in the DocBook source and extracts their content into TSV format. Header rows from <thead> become the first line of TSV output. Data rows from <tbody> follow with tab-separated values. Multiple tables in a single document can be extracted as separate TSV files or concatenated with separator lines.
This conversion is especially useful for data analysts, researchers, and engineers who need to work with data documented in DocBook format. Instead of manually copying table data, the conversion automatically extracts all tabular content into a format that Excel, Google Sheets, pandas, R, and database import tools can process directly.
Key Benefits of Converting DocBook to TSV:
- Spreadsheet Ready: Opens directly in Excel, Google Sheets, LibreOffice
- No Escaping: Tab delimiters avoid comma-in-data quoting problems
- Data Analysis: Import directly into pandas, R, or database tools
- Table Extraction: Pull structured data from complex documentation
- Database Import: Use LOAD DATA or COPY commands for direct import
- Clipboard Friendly: Paste TSV data directly into spreadsheets
- Universal: Every data tool supports tab-delimited format
Practical Examples
Example 1: Server Inventory Table
Input DocBook file (inventory.xml):
<table xmlns="http://docbook.org/ns/docbook">
<title>Production Servers</title>
<tgroup cols="4">
<thead>
<row>
<entry>Host</entry><entry>IP</entry>
<entry>OS</entry><entry>RAM</entry>
</row>
</thead>
<tbody>
<row>
<entry>web-01</entry><entry>10.0.1.10</entry>
<entry>Ubuntu 22.04</entry><entry>16 GB</entry>
</row>
<row>
<entry>db-01</entry><entry>10.0.1.20</entry>
<entry>RHEL 9</entry><entry>64 GB</entry>
</row>
</tbody>
</tgroup>
</table>
Output TSV file (inventory.tsv):
Host IP OS RAM web-01 10.0.1.10 Ubuntu 22.04 16 GB db-01 10.0.1.20 RHEL 9 64 GB
Example 2: Test Results Extraction
Input DocBook file (test-results.dbk):
<table xmlns="http://docbook.org/ns/docbook">
<title>Performance Benchmarks</title>
<tgroup cols="3">
<thead>
<row>
<entry>Test</entry>
<entry>Duration (ms)</entry>
<entry>Status</entry>
</row>
</thead>
<tbody>
<row>
<entry>API Response</entry>
<entry>45</entry>
<entry>PASS</entry>
</row>
<row>
<entry>DB Query</entry>
<entry>120</entry>
<entry>PASS</entry>
</row>
<row>
<entry>File Upload</entry>
<entry>890</entry>
<entry>WARN</entry>
</row>
</tbody>
</tgroup>
</table>
Output TSV file (test-results.tsv):
Test Duration (ms) Status API Response 45 PASS DB Query 120 PASS File Upload 890 WARN
Example 3: Configuration Reference
Input DocBook file (config-ref.xml):
<table xmlns="http://docbook.org/ns/docbook">
<title>Environment Variables</title>
<tgroup cols="3">
<thead>
<row>
<entry>Variable</entry>
<entry>Default</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>APP_PORT</entry>
<entry>3000</entry>
<entry>HTTP server port</entry>
</row>
<row>
<entry>DB_URL</entry>
<entry>localhost:5432</entry>
<entry>Database connection string</entry>
</row>
</tbody>
</tgroup>
</table>
Output TSV file (config-ref.tsv):
Variable Default Description APP_PORT 3000 HTTP server port DB_URL localhost:5432 Database connection string
Frequently Asked Questions (FAQ)
Q: What is TSV format?
A: TSV (Tab-Separated Values) is a plain text format for storing tabular data where columns are separated by tab characters (U+0009) and rows by newlines. TSV is registered with IANA as text/tab-separated-values. It is simpler than CSV because tab characters rarely appear in data values, reducing quoting and escaping complexity.
Q: How does the converter extract tables from DocBook?
A: The converter identifies all <table> and <informaltable> elements in the DocBook document. It extracts header rows from <thead> and data rows from <tbody>. Each <entry> element becomes a tab-separated field. If a document contains multiple tables, they can be output as separate TSV files or combined with blank-line separators.
Q: What is the difference between TSV and CSV?
A: TSV uses tab characters as delimiters while CSV uses commas. TSV's main advantage is that tab characters rarely appear in natural data, so values do not need quoting or escaping. CSV requires quoting values that contain commas, double quotes, or newlines. TSV is often preferred for data containing addresses, descriptions, or other text with commas.
Q: Can I open TSV files in Excel?
A: Yes, Excel, Google Sheets, and LibreOffice Calc all open TSV files natively. In Excel, you can open a .tsv file directly, and Excel will automatically detect the tab delimiter. You can also use Data > From Text and specify tab as the delimiter. Google Sheets handles TSV files through the import function with automatic delimiter detection.
Q: What happens to non-table content in the DocBook file?
A: Since TSV is a pure tabular data format, non-table content (paragraphs, lists, code blocks, headings) is not included in the TSV output by default. The converter focuses on extracting tabular data. Section titles may optionally be included as comment lines (prefixed with #) to provide context for the data tables that follow.
Q: How are merged cells handled?
A: DocBook supports cell spanning through the morerows and namest/nameend attributes. Since TSV is a flat format that does not support merged cells, spanning cells are expanded. A cell spanning two columns is repeated in both positions. A cell spanning two rows appears in both rows. This ensures the TSV has consistent column counts across all rows.
Q: Can I import the TSV output into a database?
A: Yes, most databases support TSV import. MySQL uses LOAD DATA INFILE with FIELDS TERMINATED BY '\t'. PostgreSQL uses COPY with DELIMITER E'\t'. SQLite uses .import with .mode tabs. Python's pandas library reads TSV with pd.read_csv('file.tsv', sep='\t'). The clean tabular structure makes database import straightforward.
Q: Can I convert TSV back to DocBook?
A: Yes, our converter supports TSV to DocBook conversion. The reverse process reads the TSV data, treats the first row as table headers, and generates a DocBook <table> with proper <tgroup>, <thead>, and <tbody> structure. This is useful for incorporating spreadsheet data into DocBook documentation projects.