[alibaba/easyexcel]读取excel失败--无效字符

2024-05-11 447 views
1

version: 3.0.5

读取某Excel 报了如下的错

Caused by: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xffff) was found in the element content of the document.
    at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1243)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
    at com.alibaba.excel.analysis.v07.XlsxSaxAnalyser.parseXmlSource(XlsxSaxAnalyser.java:197)

看了下解压缩后的xml(xl/sharedStrings.xml) 其中包含字符 <ffff>

但用WPS是可以打开这个excel的 参考https://stackoverflow.com/questions/5742543/an-invalid-xml-character-unicode-0xc-was-found 尝试了如下的方案 均未能解决

尝试一:

byte[] bytes = Files.readAllBytes(path);
String str = new String(bytes);
String processedStr = stripNonValidXMLCharacters(str);
final InputStream inputStream = new ByteArrayInputStream(processedStr.getBytes());
EasyExcel.read(inputStream, ExcelData.class, readListener).excelType(ExcelTypeEnum.XLSX).sheet().doRead();

public String stripNonValidXMLCharacters(String in) {
    StringBuilder out = new StringBuilder(); // Used to hold the output.
    char current; // Used to reference the current character.

    if (in == null || ("".equals(in))) return ""; // vacancy test.
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
        if ((current == 0x9) ||
                (current == 0xA) ||
                (current == 0xD) ||
                ((current >= 0x20) && (current <= 0xD7FF)) ||
                ((current >= 0xE000) && (current <= 0xFFFD)) ||
                ((current >= 0x10000) && (current <= 0x10FFFF)))
            out.append(current);
    }
    return out.toString();
}

报错:

Caused by: java.util.zip.ZipException: Unexpected record signature: 0XEF524B50
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:141)

尝试二:

byte[] bytes = Files.readAllBytes(path);
String str = new String(bytes);
Pattern p = Pattern.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\u10000-\\u10FFF]+");
String processedStr = p.matcher(str).replaceAll("");
final InputStream inputStream = new ByteArrayInputStream(processedStr.getBytes());
EasyExcel.read(inputStream, ExcelData.class, readListener).excelType(ExcelTypeEnum.XLSX).sheet().doRead();

报错

Caused by: java.util.zip.ZipException: Unexpected record signature: 0XEF524B50
    at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:260)
    at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:141)

故想问下 这个问题 如何解决?

回答

9

如果用poi读取的话 可以通过如下的方案来解决

final OPCPackage opcPackage = OPCPackage.open(new FileInputStream(fileLocation));

final List<PackagePart> partList = opcPackage.getPartsByName(Pattern.compile("(/xl/sharedStrings.xml)|(/xl/worksheets/.+\\.xml)"));
for (PackagePart packagePart : partList) {

   String sharedStrings = "";
   try (BufferedInputStream inputStream = new BufferedInputStream(packagePart.getInputStream());
        ByteArrayOutputStream sharedStringsBytes = new ByteArrayOutputStream() ) {
       byte[] buffer = new byte[1024];
       int length;
       while ((length = inputStream.read(buffer)) != -1) {
           sharedStringsBytes.write(buffer, 0, length);
       }
       sharedStrings = sharedStringsBytes.toString("UTF-8");
   } catch (Exception ex) {
       ex.printStackTrace();
   }
   String processedString = removeInvalidXmlCharacters(sharedStrings);

   try (BufferedOutputStream outputStream = new BufferedOutputStream(packagePart.getOutputStream()) ) {
       outputStream.write(processedString.getBytes("UTF-8"));
   } catch (Exception ex) {
       ex.printStackTrace();
   }
}

XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);

https://stackoverflow.com/questions/70859613/when-read-excel-how-to-skip-some-invalid-characters/70894393#70894393

但 easyexcel怎么读取修改后的opcPackage呢?

1

这个可以提供一个excel 么 我想试下结果

4

这个问题是否已经解决?如果未解决 请重新打开问题。