Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【性能优化】泛化读取未知结构的CSV的能力 #2305

Open
bigsinger opened this issue Dec 16, 2024 · 3 comments
Open

【性能优化】泛化读取未知结构的CSV的能力 #2305

bigsinger opened this issue Dec 16, 2024 · 3 comments
Labels

Comments

@bigsinger
Copy link

bigsinger commented Dec 16, 2024

方法一:
常规读取的方式:

while (csv.Read()) {
    int columnIndex = 0;
    DataRow row = dt.NewRow();

    foreach (DataColumn column in dt.Columns) {
        row[columnIndex] = csv.GetField(column.DataType, columnIndex);
        columnIndex++;
    }

    dt.Rows.Add(row);
}

方法二:
使用ClassMap:

public class FooMap : ClassMap<Foo>
{
    public FooMap()
    {
        Map(m => m.Id).Name("id");
        Map(m => m.Name).Name("name");
    }
}
  1. 方法一的读取方式能够读取任意结构的CSV,方便通过配置文件来自定义字段,进而做更进一步的分析。但是读取的速度非常慢。
  2. 方法二的读取速度是方法一的10倍左右,但是缺点是不够灵活,需要确定CSV的结构字段,没有泛化能力。

需求:
是否可以结合方法一和方法二的优点,同时兼顾速度和泛化能力,哪怕CSV的字段可以通过配置文件来设置呢?例如json:

{
  "name": "搜索content",
  "find": [
    {
      "colName": "sendid",
      "findType": "Equals",
      "keywordsFile": ".\\ips.txt",
      "keywords": [
        [ "2xxxx076", "dxxxx134" ]
      ]
    },
    {
      "colName": "content",
      "findType": "Contains",
      "keywords": [ [ "微信", "支付", "转账" ] ]
    }
  ],

  "show": [ "sendtime", "sendid", "content" ],
  "sort": {
    "colName": "sendid",
    "asc": true
  }
}

@AltruCoder
Copy link

Are you looking for something like dynamic mapping of a class? https://stackoverflow.com/a/69921900/2355006

@bigsinger
Copy link
Author

bigsinger commented Dec 17, 2024

@AltruCoder
感谢您的回复,这个思路很棒,在一定程度上帮我解决了问题。解决的方案是这样的:

public class Foo {
    public static int MaxColumnCout = 10;
    public string[] cols = new string[MaxColumnCout];

    public string c0 {
        get {
            return cols[0];
        }
        set {
            cols[0] = value;
        }
    }
    public string c1 {
        get {
            return cols[1];
        }
        set {
            cols[1] = value;
        }
    }
    public string c2 {
        get {
            return cols[2];
        }
        set {
            cols[2] = value;
        }
    }
    public string c3 {
        get {
            return cols[3];
        }
        set {
            cols[3] = value;
        }
    }
    public string c4 {
        get {
            return cols[4];
        }
        set {
            cols[4] = value;
        }
    }
    public string c5 {
        get {
            return cols[5];
        }
        set {
            cols[5] = value;
        }
    }
    public string c6 {
        get {
            return cols[6];
        }
        set {
            cols[6] = value;
        }
    }
    public string c7 {
        get {
            return cols[7];
        }
        set {
            cols[7] = value;
        }
    }
    public string c8 {
        get {
            return cols[8];
        }
        set {
            cols[8] = value;
        }
    }
    public string c9 {
        get {
            return cols[9];
        }
        set {
            cols[9] = value;
        }
    }

}

public class FooField {
    public string FieldName { get; set; }
    public int FieldIndex { get; set; }
}

public static class CsvHelperExtensions {
    public static void Map<T>(this ClassMap<T> classMap, IDictionary<string, string> csvMappings) {
        foreach (var mapping in csvMappings) {
            var property = typeof(T).GetProperty(mapping.Key);

            if (property == null) {
                throw new ArgumentException($"Class {typeof(T).Name} does not have a property named {mapping.Key}");
            }

            classMap.Map(typeof(T), property).Name(mapping.Value);
        }
    }
}
Dictionary<string, FooField> mapColumnField = new();    // 列名对应的属性及属性索引


 var mapping = new Dictionary<string, string>();
 mapColumnField.Clear();


 // 先读取header
 string[]? allColumns = CSVBaseTable.ReadHeadersOnly(files[0]);

 // 从json配置中读取要搜索或显示的列
 List<string> subColumns = new();
 // load data for subColumns...
 // ... 

 // 检查配置文件是否匹配该CSV文件
 if (allColumns != null) {
     bool valid = subColumns.All(column => allColumns.Contains(column));
     if (!valid) {
         throw new("存在不正确的列名,请检查");
     }

     int maxColumnCount = Math.Min(subColumns.Count, Foo.MaxColumnCout);
     for (int i = 0; i < maxColumnCount; i++) {
         string fieldName = $"c{i}";
         mapping.Add(fieldName, subColumns[i]);
         mapColumnField.Add(subColumns[i], new FooField() { FieldIndex = 0, FieldName = fieldName });
     }
 }





using (var r = new StreamReader(files[0]))
using (var csv = new CsvReader(r, CultureInfo.InvariantCulture)) {
    var fooMap = new DefaultClassMap<Foo>();

    fooMap.Map(mapping);

    csv.Context.RegisterClassMap(fooMap);

    var records = csv.GetRecords<Foo>().ToList();
    addedRows = records.Count;
}

感觉仍然可以优化,不知道有什么好的建议吗?

现在能解决我的实际问题,但是需要显示地为Foo 定义足够多的属性才可以,是否可以允许直接绑定到 Foo 的一个类型为数组的属性上?

我的需求是:可以读取和处理任意的CSV文件实现一个快速搜索匹配的功能,不用关心其具体的列是什么结构,是通过一个外部的json配置文件来去设置部分列,实际上我们不需要把所有列的内容都读进内存中去,这样还有助于提高速度。

@bigsinger
Copy link
Author

bigsinger commented Dec 17, 2024

补充一下:想要一个类似DataTable的功能,DataTable目前相对于您的CsvHelper来说读取的速度太慢了。

如下的方式读取速度也很慢,不如csv.GetRecords<Foo>()这种方法速度快。

while (csv.Read()) {
    int columnIndex = 0;
    DataRow row = dt.NewRow();

    foreach (DataColumn column in dt.Columns) {
        row[columnIndex] = csv.GetField(column.DataType, columnIndex);
        columnIndex++;
    }

    dt.Rows.Add(row);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants