关于c#:克隆Office Open XML文档的最有效方法是什么?

What is the most efficient way to clone Office Open XML documents?

使用Office Open XML文档(例如自Office 2007发行以来由Word,Excel或PowerPoint创建的文档)时,您通常会希望克隆或复制现有文档,然后对该克隆进行更改,从而创建一个新文档。

在这种情况下,已经提出并回答了几个问题(有时是错误的,或者至少不是最佳的),这表明用户确实面临问题。例如:

  • 使用OpenXml和C#复制Word文档
  • Word OpenXml Word发现不可读的内容
  • 打开XML SDK:打开Word模板并保存到其他文件名
  • 通过OpenXML C#复制时docx文档损坏

因此,问题是:

  • 正确克隆或复制这些文档的可能方法是什么?
  • 哪种方法最有效?

  • 下面的示例类显示了多种方法来正确复制几乎所有文件并将其返回到MemoryStreamFileStream上,然后您可以从中打开WordprocessingDocument(Word),SpreadsheetDocument(Excel)或PresentationDocument(PowerPoint)并使用Open XML SDK和可选的Open-XML-PowerTools进行任何更改。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    using System.IO;

    namespace CodeSnippets.IO
    {
        /// <summary>
        /// This class demonstrates multiple ways to clone files stored in the file system.
        /// In all cases, the source file is stored in the file system. Where the return type
        /// is a <see cref="MemoryStream"/>, the destination file will be stored only on that
        /// <see cref="MemoryStream"/>. Where the return type is a <see cref="FileStream"/>,
        /// the destination file will be stored in the file system and opened on that
        /// <see cref="FileStream"/>.
        /// </summary>
        /// <remarks>
        /// The contents of the <see cref="MemoryStream"/> instances returned by the sample
        /// methods can be written to a file as follows:
        ///
        ///     var stream = ReadAllBytesToMemoryStream(sourcePath);
        ///     File.WriteAllBytes(destPath, stream.GetBuffer());
        ///
        /// You can use <see cref="MemoryStream.GetBuffer"/> in cases where the MemoryStream
        /// was created using <see cref="MemoryStream()"/> or <see cref="MemoryStream(int)"/>.
        /// In other cases, you can use the <see cref="MemoryStream.ToArray"/> method, which
        /// copies the internal buffer to a new byte array. Thus, GetBuffer() should be a tad
        /// faster.
        /// </remarks>
        public static class FileCloner
        {
            public static MemoryStream ReadAllBytesToMemoryStream(string path)
            {
                byte[] buffer = File.ReadAllBytes(path);
                var destStream = new MemoryStream(buffer.Length);
                destStream.Write(buffer, 0, buffer.Length);
                destStream.Seek(0, SeekOrigin.Begin);
                return destStream;
            }

            public static MemoryStream CopyFileStreamToMemoryStream(string path)
            {
                using FileStream sourceStream = File.OpenRead(path);
                var destStream = new MemoryStream((int) sourceStream.Length);
                sourceStream.CopyTo(destStream);
                destStream.Seek(0, SeekOrigin.Begin);
                return destStream;
            }

            public static FileStream CopyFileStreamToFileStream(string sourcePath, string destPath)
            {
                using FileStream sourceStream = File.OpenRead(sourcePath);
                FileStream destStream = File.Create(destPath);
                sourceStream.CopyTo(destStream);
                destStream.Seek(0, SeekOrigin.Begin);
                return destStream;
            }

            public static FileStream CopyFileAndOpenFileStream(string sourcePath, string destPath)
            {
                File.Copy(sourcePath, destPath, true);
                return new FileStream(destPath, FileMode.Open, FileAccess.ReadWrite, FileShare.None);
            }
        }
    }

    除了上述与XML无关的打开方法之外,还可以使用以下方法,例如,如果您已经打开了OpenXmlPackage,例如WordprocessingDocumentSpreadsheetDocumentPresentationDocument

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    public void DoWorkCloningOpenXmlPackage()
    {
        using WordprocessingDocument sourceWordDocument = WordprocessingDocument.Open(SourcePath, false);

        // There are multiple overloads of the Clone() method in the Open XML SDK.
        // This one clones the source document to the given destination path and
        // opens it in read-write mode.
        using var wordDocument = (WordprocessingDocument) sourceWordDocument.Clone(DestPath, true);

        ChangeWordprocessingDocument(wordDocument);
    }

    以上所有方法均正确克隆或复制文档。但是,最有效的是什么?

    输入我们的基准测试,该基准测试使用BenchmarkDotNet NuGet软件包:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    using System;
    using System.Collections.Generic;
    using System.Diagnostics.CodeAnalysis;
    using System.IO;
    using System.Linq;
    using BenchmarkDotNet.Attributes;
    using CodeSnippets.IO;
    using CodeSnippets.OpenXml.Wordprocessing;
    using DocumentFormat.OpenXml.Packaging;
    using DocumentFormat.OpenXml.Wordprocessing;

    namespace CodeSnippets.Benchmarks.IO
    {
        public class FileClonerBenchmark
        {
            #region Setup and Helpers

            private const string SourcePath ="Source.docx";
            private const string DestPath ="Destination.docx";

            [Params(1, 10, 100, 1000)]
            public static int ParagraphCount;

            [GlobalSetup]
            public void GlobalSetup()
            {
                CreateTestDocument(SourcePath);
                CreateTestDocument(DestPath);
            }

            private static void CreateTestDocument(string path)
            {
                const string sentence ="The quick brown fox jumps over the lazy dog.";
                string text = string.Join("", Enumerable.Range(0, 22).Select(i => sentence));
                IEnumerable<string> texts = Enumerable.Range(0, ParagraphCount).Select(i => text);
                using WordprocessingDocument unused = WordprocessingDocumentFactory.Create(path, texts);
            }

            private static void ChangeWordprocessingDocument(WordprocessingDocument wordDocument)
            {
                Body body = wordDocument.MainDocumentPart.Document.Body;
                Text text = body.Descendants<Text>().First();
                text.Text = DateTimeOffset.UtcNow.Ticks.ToString();
            }

            #endregion

            #region Benchmarks

            [Benchmark(Baseline = true)]
            public void DoWorkUsingReadAllBytesToMemoryStream()
            {
                using MemoryStream destStream = FileCloner.ReadAllBytesToMemoryStream(SourcePath);

                using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(destStream, true))
                {
                    ChangeWordprocessingDocument(wordDocument);
                }

                File.WriteAllBytes(DestPath, destStream.GetBuffer());
            }

            [Benchmark]
            public void DoWorkUsingCopyFileStreamToMemoryStream()
            {
                using MemoryStream destStream = FileCloner.CopyFileStreamToMemoryStream(SourcePath);

                using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(destStream, true))
                {
                    ChangeWordprocessingDocument(wordDocument);
                }

                File.WriteAllBytes(DestPath, destStream.GetBuffer());
            }

            [Benchmark]
            public void DoWorkUsingCopyFileStreamToFileStream()
            {
                using FileStream destStream = FileCloner.CopyFileStreamToFileStream(SourcePath, DestPath);
                using WordprocessingDocument wordDocument = WordprocessingDocument.Open(destStream, true);
                ChangeWordprocessingDocument(wordDocument);
            }

            [Benchmark]
            public void DoWorkUsingCopyFileAndOpenFileStream()
            {
                using FileStream destStream = FileCloner.CopyFileAndOpenFileStream(SourcePath, DestPath);
                using WordprocessingDocument wordDocument = WordprocessingDocument.Open(destStream, true);
                ChangeWordprocessingDocument(wordDocument);
            }

            [Benchmark]
            public void DoWorkCloningOpenXmlPackage()
            {
                using WordprocessingDocument sourceWordDocument = WordprocessingDocument.Open(SourcePath, false);
                using var wordDocument = (WordprocessingDocument) sourceWordDocument.Clone(DestPath, true);
                ChangeWordprocessingDocument(wordDocument);
            }

            #endregion
        }
    }

    上述基准运行如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    using BenchmarkDotNet.Running;
    using CodeSnippets.Benchmarks.IO;

    namespace CodeSnippets.Benchmarks
    {
        public static class Program
        {
            public static void Main()
            {
                BenchmarkRunner.Run<FileClonerBenchmark>();
            }
        }
    }

    我的机器上有什么结果?哪种方法最快?

    1
    2
    3
    4
    5
    BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
    Intel Core i7-7500U CPU 2.70GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
    .NET Core SDK=3.0.100
      [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
      DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    | Method                                  | ParaCount |      Mean |     Error |    StdDev |    Median | Ratio |
    | --------------------------------------- | --------- | --------: | --------: | --------: | --------: | ----: |
    | DoWorkUsingReadAllBytesToMemoryStream   | 1         |  1.548 ms | 0.0298 ms | 0.0279 ms |  1.540 ms |  1.00 |
    | DoWorkUsingCopyFileStreamToMemoryStream | 1         |  1.561 ms | 0.0305 ms | 0.0271 ms |  1.556 ms |  1.01 |
    | DoWorkUsingCopyFileStreamToFileStream   | 1         |  2.394 ms | 0.0601 ms | 0.1100 ms |  2.354 ms |  1.55 |
    | DoWorkUsingCopyFileAndOpenFileStream    | 1         |  3.302 ms | 0.0657 ms | 0.0855 ms |  3.312 ms |  2.12 |
    | DoWorkCloningOpenXmlPackage             | 1         |  4.567 ms | 0.1218 ms | 0.3591 ms |  4.557 ms |  3.13 |
    |                                         |           |           |           |           |           |       |
    | DoWorkUsingReadAllBytesToMemoryStream   | 10        |  1.737 ms | 0.0337 ms | 0.0361 ms |  1.742 ms |  1.00 |
    | DoWorkUsingCopyFileStreamToMemoryStream | 10        |  1.752 ms | 0.0347 ms | 0.0571 ms |  1.739 ms |  1.01 |
    | DoWorkUsingCopyFileStreamToFileStream   | 10        |  2.505 ms | 0.0390 ms | 0.0326 ms |  2.500 ms |  1.44 |
    | DoWorkUsingCopyFileAndOpenFileStream    | 10        |  3.532 ms | 0.0731 ms | 0.1860 ms |  3.455 ms |  2.05 |
    | DoWorkCloningOpenXmlPackage             | 10        |  4.446 ms | 0.0880 ms | 0.1470 ms |  4.424 ms |  2.56 |
    |                                         |           |           |           |           |           |       |
    | DoWorkUsingReadAllBytesToMemoryStream   | 100       |  2.847 ms | 0.0563 ms | 0.0553 ms |  2.857 ms |  1.00 |
    | DoWorkUsingCopyFileStreamToMemoryStream | 100       |  2.865 ms | 0.0561 ms | 0.0786 ms |  2.868 ms |  1.02 |
    | DoWorkUsingCopyFileStreamToFileStream   | 100       |  3.550 ms | 0.0697 ms | 0.0881 ms |  3.570 ms |  1.25 |
    | DoWorkUsingCopyFileAndOpenFileStream    | 100       |  4.456 ms | 0.0877 ms | 0.0861 ms |  4.458 ms |  1.57 |
    | DoWorkCloningOpenXmlPackage             | 100       |  5.958 ms | 0.1242 ms | 0.2727 ms |  5.908 ms |  2.10 |
    |                                         |           |           |           |           |           |       |
    | DoWorkUsingReadAllBytesToMemoryStream   | 1000      | 12.378 ms | 0.2453 ms | 0.2519 ms | 12.442 ms |  1.00 |
    | DoWorkUsingCopyFileStreamToMemoryStream | 1000      | 12.538 ms | 0.2070 ms | 0.1835 ms | 12.559 ms |  1.02 |
    | DoWorkUsingCopyFileStreamToFileStream   | 1000      | 12.919 ms | 0.2457 ms | 0.2298 ms | 12.939 ms |  1.05 |
    | DoWorkUsingCopyFileAndOpenFileStream    | 1000      | 13.728 ms | 0.2803 ms | 0.5196 ms | 13.652 ms |  1.11 |
    | DoWorkCloningOpenXmlPackage             | 1000      | 16.868 ms | 0.2174 ms | 0.1927 ms | 16.801 ms |  1.37 |

    事实证明,DoWorkUsingReadAllBytesToMemoryStream()始终是最快的方法。但是,DoWorkUsingCopyFileStreamToMemoryStream()的裕度很容易出现误差裕度。这意味着您应尽可能在MemoryStream上打开Open XML文档以进行处理。而且,如果您不必将生成的文档存储在文件系统中,这甚至比不必要地使用FileStream

    快得多。

    无论涉及到输出FileStream,您都会看到一个更大的"明显"差异(请注意,如果处理大量文档,毫秒可能会有所不同)。您应该注意,实际上使用File.Copy()并不是一种很好的方法。

    最后,使用OpenXmlPackage.Clone()方法或其替代之一是最慢的方法。这是由于这样的事实,它涉及比复制字节更多的复杂逻辑。但是,如果仅获得对OpenXmlPackage(或实际上是其子类之一)的引用,则Clone()方法及其覆盖是您的最佳选择。

    您可以在我的CodeSnippets GitHub存储库中找到完整的源代码。查看CodeSnippets.Benchmark项目和FileCloner类。