Word文書にある下付き文字、上付き文字、方程式をプログラムで反復する方法

Question

私はいくつかのWord文書を持っており、それぞれに次のような数百ページの科学データが含まれています。

化学式（すべての適切な下付き文字と上付き文字を含むH2SO4）
科学的数値（上付き文字を使用してフォーマットされた指数）
たくさんの数式。 Wordの数式エディタを使用して書かれています。

問題は、このデータをWordに保存することは私たちにとって効率的ではないということです。したがって、このすべての情報をデータベース（MySQL）に保存する必要があります。フォーマットをLaTexに変換したいと思います。

VBAを使用してWord文書内のすべての下付き文字、上付き文字、数式を反復処理する方法はありますか？

mnmnc · Accepted Answer

はいあります。 PowershellはWordファイルを非常にうまく処理するので、私はPowershellを使用することをお勧めします。私が一番簡単な方法だと思います。

PowershellとWordの自動化の詳細については、こちらをご覧ください： http://www.simple-talk.com/dotnet/.net-tools/com-automation-of-office-applications-via-powershell/

私はもう少し深く掘り下げて、このPowerShellスクリプトを見つけました：

param([string]$docpath,[string]$htmlpath = $docpath) $srcfiles = Get-ChildItem $docPath -filter "*.doc" $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML"); $Word = new-object -comobject Word.application $Word.Visible = $False function saveas-filteredhtml { $opendoc = $Word.documents.open($doc.FullName); $opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat); $opendoc.close(); } ForEach ($doc in $srcfiles) { Write-Host "Processing :" $doc.FullName saveas-filteredhtml $doc = $null } $Word.quit();

.ps1として保存し、次のコマンドで開始します。

convertdoc-tohtml.ps1 -docpath "C:\Documents" -htmlpath "C:\Output"

指定したディレクトリのすべての.docファイルをhtmlファイルとして保存します。だから私は下付き文字付きのH2SO4を持っているドキュメントファイルを持っています、そしてPowerShell変換後の出力は次のとおりです：

<html> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=Generator content="Microsoft Word 14 (filtered)"> <style> <!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; line-height:115%; font-size:11.0pt; font-family:"Calibri","sans-serif";} .MsoChpDefault {font-family:"Calibri","sans-serif";} .MsoPapDefault {margin-bottom:10.0pt; line-height:115%;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --> </style> </head> <body lang=EN-US> <div class=WordSection1> <p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p> </div> </body> </html>

ご覧のとおり、添え字にはHTMLで独自のタグがあるため、残っているのはファイルをbashまたはc ++で解析して、本文から/ bodyに切り取り、をLATEXに変更して、後で残りのHTMLタグを削除することだけです。

http://blogs.technet.com/b/bshukla/archive/2011/09/27/3347395.aspx からのコード

そこで、C++でパーサーを開発して、HTMLの添え字を探し、それをLATEXの添え字に置き換えました。

コード：

#include <iostream> #include <fstream> #include <string> #include <sstream> #include <vector> using namespace std; vector < vector <string> > parse( vector < vector <string> > vec, string filename ) { /* PARSES SPECIFIED FILE. EACH Word SEPARATED AND PLACED IN VECTOR FIELD. REQUIRED INCLUDES: #include <iostream> #include <fstream> #include <string> #include <sstream> #include <vector> EXPECTS: TWO DIMENTIONAL VECTOR STRING WITH FILENAME RETURNS: TWO DIMENTIONAL VECTOR vec[lines][words] */ string vword; ifstream vfile; string tmp; // FILENAME CONVERSION FROM STING // TO CHAR TABLE char cfilename[filename.length()+1]; if( filename.length() < 126 ) { for(int i = 0; i < filename.length(); i++) cfilename[i] = filename[i]; cfilename[filename.length()] = '\0'; } else return vec; // OPENING FILE // vfile.open( cfilename ); if (vfile.is_open()) { while ( vfile.good() ) { getline( vfile, vword ); vector < string > vline; vline.clear(); for (int i = 0; i < vword.length(); i++) { tmp = ""; // PARSING CONTENT. OMITTING SPACES AND TABS // while (vword[i] != ' ' && vword[i] != ((char)9) && i < vword.length() ) tmp += vword[i++]; if( tmp.length() > 0 ) vline.Push_back(tmp); } if (!vline.empty()) vec.Push_back(vline); } vfile.close(); } else cout << "Unable to open file " << filename << ".
"; return vec; } int main() { vector < vector < string > > vec; vec = parse( vec, "parse.html" ); bool body = false; for (int i = 0; i < vec.size(); i++) { for (int j = 0; j < vec[i].size(); j++) { if ( vec[i][j] == "<body") body=true; if ( vec[i][j] == "</body>" ) body=false; if ( body == true ) { for ( int k=0; k < vec[i][j].size(); k++ ) { if (k+4 < vec[i][j].size() ) { if ( vec[i][j][k] == '<' && vec[i][j][k+1] == 's' && vec[i][j][k+2] == 'u' && vec[i][j][k+3] == 'b' && vec[i][j][k+4] == '>' ) { string tmp = ""; while (vec[i][j][k+5] != '<') { tmp+=vec[i][j][k+5]; k++; } tmp = "_{" + tmp + "}"; k=k+5+5; cout << tmp << endl;; } else cout << vec[i][j][k]; } else cout << vec[i][j][k]; } cout << endl; } } } return 0; }

Htmlファイルの場合：

<html> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> <meta name=Generator content="Microsoft Word 14 (filtered)"> <style> <!-- /* Font Definitions */ @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; line-height:115%; font-size:11.0pt; font-family:"Calibri","sans-serif";} .MsoChpDefault {font-family:"Calibri","sans-serif";} .MsoPapDefault {margin-bottom:10.0pt; line-height:115%;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --> </style> </head> <body lang=EN-US> <div class=WordSection1> <p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p> </div> </body> </html>

出力は次のとおりです。

<body lang=EN-US> <div class=WordSection1> <p class=MsoNormal><span lang=PL>H_{2} SO_{4} </span></p> </div>

もちろん理想的ではありませんが、扱いは概念実証です。

James Mertz · Answer

2007以降のOfficeドキュメントから直接xmlを抽出できます。これは次の方法で行われます。

ファイルの名前を.docxから.Zipに変更します
7Zip（または他の抽出プログラム）を使用してファイルを抽出します
ドキュメントの実際の内容については、Wordサブフォルダーとdocument.xmlファイルの下にある抽出フォルダーを確認してください。これには、ドキュメントのすべてのコンテンツが含まれている必要があります。

enter image description here

サンプルドキュメントを作成しましたが、bodyタグでこれを見つけました（これをすばやくまとめたので、フォーマットが少しずれている可能性があります）。

<?xml version="1.0" encoding="UTF-8" standalone="true"?> <w:body> -<w:p w:rsidRDefault="000E0C3A" w:rsidR="008B5DAA"> -<w:r> <w:t xml:space="preserve">This </w:t> </w:r> - <w:r w:rsidRPr="000E0C3A"> -<w:rPr> <w:vertAlign w:val="superscript"/> </w:rPr> <w:t>is</w:t> </w:r> - <w:r> <w:t xml:space="preserve"> a </w:t> </w:r> -<w:r w:rsidRPr="000E0C3A"> -<w:rPr> <w:vertAlign w:val="subscript"/> </w:rPr> <w:t>test</w:t> </w:r> -<w:r> <w:t>.</w:t> </w:r> </w:p> </w:body>

<w:t>タグはテキスト用であり、<w:rPr>はフォントの定義であり、<w:p>は新しい段落であるようです。

同等の単語は次のようになります。

enter image description here

Tony Dallimore · Answer

私は、mnmncが追求しているアプローチとは異なるアプローチを検討してきました。

テスト用のWord文書をHTMLとして保存しようとしても成功しませんでした。私は過去に、Officeで生成されたHTMLがもみ殻でいっぱいであるため、必要なビットを選択することはほぼ不可能であることを発見しました。私はそれがここに当てはまることに気づきました。方程式にも問題がありました。 Wordは方程式を画像として保存します。各方程式には、WMZの拡張子を持つ画像とGIFの拡張子を持つ画像が2つあります。 Google Chromeでhtmlファイルを表示する場合、方程式は問題ないように見えますが、すばらしいものではありません。透明な画像を処理できる画像表示/編集ツールで表示すると、外観はGIFファイルと一致します。 Internet ExplorerでHTMLファイルを表示すると、方程式は完璧に見えます。 HTMLはWMZファイルを参照しているので、Internet ExplorerにはWMZファイルを表示する拡張機能が含まれていると思います。WMZファイルは、WMPが破損していると主張していますが、明らかにWindows MediaPlayerのスキンです。

追加情報

この情報を元の回答に含めるべきでした。

私はHtmlとして保存した小さなWord文書を作成しました。下の画像の3つのパネルは、元のWordドキュメント、Microsoft Internet Explorerによって表示されたHtmlドキュメント、およびGoogleChromeによって表示されたHtmlドキュメントを示しています。

Original Word, Html displayed by IE and HTML displayed by Chrome

前に説明したように、IEとChromeの画像の違いは、方程式がWMZ形式とGIF形式の2回保存された結果です。 HTMLが大きすぎて、ここに表示できません。

マクロによって作成されるHTMLは次のとおりです。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head><body> <p>Some ordinary text.</p> <p>H<sub>2</sub>SO<sub>4</sub>.</p> <p>Abc &amp; def &gt; ghi &lt; jkl</p> <p>x<sup>3</sup>+ x<sup>2</sup>+3x+4=0.</p><p></p> <p><i>Equation</i> </p> <p>Mno</p> <p><i>Equation</i></p> </body></html>

次のように表示されます：

Html created by macro as displayed by IE

無料の MathTypeソフトウェア開発キットには明らかにLaTexに変換するルーチンが含まれているため、方程式の変換は試みていません。

コードはかなり基本的なものなので、コメントはあまりありません。不明な点がないか尋ねます。注：これは元のコードの改良版です。

Sub ConvertToHtml() Dim FileNum As Long Dim NumPendingCR As Long Dim objChr As Object Dim PathCrnt As String Dim rng As Word.Range Dim WithinPara As Boolean Dim WithinSuper As Boolean Dim WithinSub As Boolean FileNum = FreeFile PathCrnt = ActiveDocument.Path Open PathCrnt & "\TestWord.html" For Output Access Write Lock Write As #FileNum Print #FileNum, "<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Frameset//EN""" & _ " ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"">" & _ vbCr & vbLf & "<html xmlns=""http://www.w3.org/1999/xhtml"" " & _ "xml:lang=""en"" lang=""en"">" & vbCr & vbLf & _ "<head><meta http-equiv=""Content-Type"" content=""text/html; " _ & "charset=utf-8"" />" & vbCr & vbLf & "</head><body>" For Each rng In ActiveDocument.StoryRanges NumPendingCR = 0 WithinPara = False WithinSub = False WithinSuper = False Do While Not (rng Is Nothing) For Each objChr In rng.Characters If objChr.Font.Superscript Then If Not WithinSuper Then ' Start of superscript Print #FileNum, "<sup>"; WithinSuper = True End If ElseIf WithinSuper Then ' End of superscript Print #FileNum, "</sup>"; WithinSuper = False End If If objChr.Font.Subscript Then If Not WithinSub Then ' Start of subscript Print #FileNum, "<sub>"; WithinSub = True End If ElseIf WithinSub Then ' End of subscript Print #FileNum, "</sub>"; WithinSub = False End If Select Case objChr Case vbCr NumPendingCR = NumPendingCR + 1 Case "&" Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&amp;"; Case "<" Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&lt;"; Case ">" Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&gt;"; Case Chr(1) Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "<i>Equation</i>"; Case Else Print #FileNum, CheckPara(NumPendingCR, WithinPara) & objChr; End Select Next Set rng = rng.NextStoryRange Loop Next If WithinPara Then Print #FileNum, "</p>"; withpara = False End If Print #FileNum, vbCr & vbLf & "</body></html>" Close FileNum End Sub Function CheckPara(ByRef NumPendingCR As Long, _ ByRef WithinPara As Boolean) As String ' Have a character to output. Check paragraph status, return ' necessary commands and adjust NumPendingCR and WithinPara. Dim RtnValue As String RtnValue = "" If NumPendingCR = 0 Then If Not WithinPara Then CheckPara = "<p>" WithinPara = True Else CheckPara = "" End If Exit Function End If If WithinPara And (NumPendingCR > 0) Then ' Terminate paragraph RtnValue = "</p>" NumPendingCR = NumPendingCR - 1 WithinPara = False End If Do While NumPendingCR > 1 ' Replace each pair of CRs with an empty paragraph RtnValue = RtnValue & "<p></p>" NumPendingCR = NumPendingCR - 2 Loop RtnValue = RtnValue & vbCr & vbLf & "<p>" WithinPara = True NumPendingCR = 0 CheckPara = RtnValue End Function

soandos · Answer

これを行う最も簡単な方法は、VBAの次の行です。

Sub testing() With ActiveDocument.Content.Find .ClearFormatting .Format = True .Font.Superscript = True .Execute Forward:=True End With End Sub

これにより、上付きのテキストがすべて検索されます。それを使って何かをしたい場合は、メソッドに挿入するだけです。たとえば、上付き文字で「super」という単語を見つけて「superfound」に変換するには、次のように使用します。

Sub testing() With ActiveDocument.Content.Find .ClearFormatting .Format = True .Font.Superscript = True .Execute Forward:=True, Replace:=wdReplaceAll, _ FindText:="super", ReplaceWith:="super found" End With End Sub