zh-cn.search-data.json

{"/about/":{"data":{"":"This Blog is still under construction.","thanks-to#Thanks to":" github Hugo Hextra Tailwind CSS Heroicons Nextra Next.js "},"title":"About"},"/blog/":{"data":{"here-is-blogs-1#here is blogs 1":"here is blogs 1"},"title":"Blog"},"/blog/%E5%9F%BA%E5%9B%A0%E7%BB%84%E6%B5%8B%E5%BA%8F%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86/":{"data":{"测序数据处理#测序数据处理":"测序数据处理"},"title":"Genome seq data process"},"/blog/%E6%95%B0%E6%8D%AE%E5%BA%93%E7%BD%91%E7%AB%99%E7%9B%B8%E5%85%B3/":{"data":{"记录数据库以及个人网站搭建#记录数据库以及个人网站搭建":"记录数据库以及个人网站搭建"},"title":"Database website construction"},"/blog/%E6%95%B0%E6%8D%AE%E5%BA%93%E7%BD%91%E7%AB%99%E7%9B%B8%E5%85%B3/hugo_github_buildblog/":{"data":{"":"平台：macbook\n时间：2024.6\n参考：\nhugo教程：https://hugo.opendocs.io/getting-started/quick-start/\nhextra主题：https://imfing.github.io/hextra/docs/getting-started/\n其他大佬的分享：https://krislinzhao.github.io/docs/create-a-wesite-using-github-pages-and-hugo/\nhttps://www.gohugo.org/","hogu创建站点#hogu创建站点":"仿照krislinzhao分享的教程，创建2个 GitHub 仓库，1个源仓库：用于储存所有 Markdown 、图片等。勾选 Pricate，设置为私人仓库。\n1个GitHub Pages 仓库：储存由 Hugo 从Markdown 文件生成的 HTML 文件。命名为\u003cusername.github.io\u003e（严格规范），设置为公开仓库。\n这样就可以在本地博客源仓库中进行修改，之后 Hugo 生成的静态 HTML 文件部署到远端 GitHub Pages 仓库中。当然也可以都放在一个仓库。\nhugo new site BlogSource #由于选择的主题的要求，我再加一个参数 hugo new site BlogSource --format=yaml 下载主题,主题可以从官方查找\ngit init git submodule add https://github.com/imfing/hextra.git themes/hextra #在hugo.yaml中增加 theme: hextra 创建文档\nhugo new about.md 生成的md文件自动产生于 content目录下，文档格式为：\n+++ date = \"2015-10-25T08:36:54-07:00\" draft = true title = \"about\" +++ date是文件创建日期；draft: true代表这篇文章是一个草稿，Hugo 不会显示草稿，要在主页显示添加的文章，可以设置 draft: false 此时已经已经可以在本地进行预览：\nhugo server --buildDrafts --disableFastRender ","工具配置#工具配置":"hugo\nbrew install hugo 其他平台参考官方手册 git安装","文件架构#文件架构":"文件组织架构可以理解为目录的结构，需要结合使用的主题进行实践探索\nkrislinzhao分享的教程中提到可以把theme的exampleSite中文件直接复制过来，以减少许多问题，但对我所使用的主题hextra不起作用，有幸hextra也提供了很完整的教程,尤其是已经配置了github的工作流，可以直接复制它的项目进行一点内容的修改就可以直接用了。\nhextra主题支持的架构如下：\ncontent ├── _index.md // \u003c- / ├── docs │ ├── _index.md // \u003c- /docs/ │ ├── getting-started.md // \u003c- /docs/getting-started/ │ └── guide │ ├── _index.md // \u003c- /docs/guide/ │ └── organize-files.md // \u003c- /docs/guide/organize-files/ └── blog ├── _index.md // \u003c- /blog/ └── post-1.md // \u003c- /blog/post-1/ _index.md是索引，会自定根据你的文件夹结构进行侧边栏的层次设置","通过git同步#通过Git同步":"保证网站没有更改的时候可以强制推送到远程仓库的main，一定要保重远程没有更改的情况下\ngit push -f origin main 关于自动化同步的我也找到一个脚本如下（来源待补充）\n#!/bin/sh # 任一步骤执行失败都会终止整个部署过程 set -e printf \"\\033[0;32mDeploying updates to GitHub...\\033[0m\\n\" # 构建静态内容 #hugo # if using a theme, replace with `hugo -t \u003cYOURTHEME\u003e` hugo --theme=hextra --baseURL=\"http://username.github.io/\" # 切换到 Public 文件夹 cd public # 添加更改到 git git add . # 提交更改 msg=\"rebuilding site $(date)\" if [ -n \"$*\" ]; then msg=\"$*\" fi git commit -m \"$msg\" # 推送到远程仓库 git push origin main 如何远程有更改，就先把远程拉到本地（注意在publish文件夹执行），之后执行推送\ncd public git fetch origin git reset --hard origin/main cd .. ./deploy.sh 至此，暂结。","部署到到github-pages#部署到到GitHub pages":"我们已经建立了一个初具雏形的网页，对我来讲我还需要对它进行一个“部署”的操作，也是教程非常清晰了,在public目录新建.github/workflows/pages.yaml文件，按照教程填好内容即可。\n之后就可以走git的流程啦\n$ hugo $ cd public # 切换到本地存储网站内容的文件夹（假设名称为public） $ #git init # 在当前文件夹中初始化一个新的 Git 仓库,只在第一次使用 $ #git remote add origin https://github.com/username/username.github.io.git # 将本地仓库关联到远程仓库，#只在第一次使用 $ git add -A # 将所有更改的文件添加到 Git 暂存区 $ git commit -m \"first commit\" # 提交更改并附上提交信息 \"first commit\" $ git push -u origin master # 将本地仓库的内容推送到远程仓库的 master 分支，并设置 \"origin\" 作为默认远程仓库 "},"title":"hugo联合github搭建blog"},"/blog/%E6%95%B0%E6%8D%AE%E5%BA%93%E7%BD%91%E7%AB%99%E7%9B%B8%E5%85%B3/mysql/":{"data":{"":"记录mysql"},"title":"Mysql"},"/blog/%E7%A8%80%E5%A5%87%E5%8F%A4%E6%80%AA%E6%8A%A5%E9%94%99/":{"data":{"记录各种报错以及解决方法#记录各种报错以及解决方法":"记录各种报错以及解决方法"},"title":"Variable error reporting"},"/docs/":{"data":{"":"","context#context":" papers genome Novels "},"title":"_index"},"/docs/biotools/flaml1/":{"data":{"":"FLAML是一个轻量级的 Python 库，用于机器学习和 AI 操作的高效自动化。它基于大型语言模型、机器学习模型等实现工作流程自动化，并优化其性能","主要特点#主要特点":" FLAML 能够以最小的工作量构建基于多智能体对话的下一代 GPT-X 应用程序。它简化了复杂 GPT-X 工作流程的编排、自动化和优化。它最大限度地提高了 GPT-X 模型的性能并增强了它们的弱点。 对于常见的机器学习任务（如分类和回归），它可以快速为计算资源较少的用户提供的数据找到高质量的模型。它易于定制或扩展。 它支持快速、经济的自动调优，能够处理具有异构评估成本和复杂约束/引导/提前停止的大搜索空间。 补充知识 Autogen 通过多代理对话框架启用下一代大型语言模型(LLM)应用程序，可以实现类似于输入语言描述程序操作而自动执行的模式，暂未试用。\nimport tempfile temp_dir = tempfile.gettempdir() arithmetic_agent = ConversableAgent( name=\"算术代理\", llm_config=False, human_input_mode=\"ALWAYS\", # 这个代理将始终需要人工输入，以确保代码执行安全。 code_execution_config={\"use_docker\": False, \"work_dir\": temp_dir}, ) code_writer_agent = ConversableAgent( name=\"代码撰写代理\", system_message=\"你是一名代码撰写者。你可以在 Markdown 代码块中编写 Python 脚本。\", llm_config={\"config_list\": [{\"model\": \"gpt-4\", \"api_key\": os.environ[\"OPENAI_API_KEY\"]}]}, human_input_mode=\"NEVER\", ) poetry_agent = ConversableAgent( name=\"诗歌代理\", system_message=\"你是一位 AI 诗人。\", llm_config={\"config_list\": [{\"model\": \"gpt-4\", \"api_key\": os.environ[\"OPENAI_API_KEY\"]}]}, human_input_mode=\"NEVER\", ) spark,大数据通用计算平台","使用#使用":"安装 pip install flaml #在notebook上运行需要添加notebook参数，autogen同理 pip install \"flaml[notebook]\" #conda conda install flaml -c conda-forge 训练 flaml.AutoML 是面向任务的 AutoML 类。它可以用作具有标准 fit 和 predict 函数的 scikit-learn 样式估计器，x_train和y_train可以以以numpy array或pandas dataframe格式传入，模式如下：\n# Prepare training data # ... from flaml import AutoML automl = AutoML() other_settings = { \"time_budget\": 500, # 总时间上限(单位秒) “estimator_list” ：['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth']， \"metric\": 'roc_auc', # 候选可以是: 'r2', 'rmse', 'mae', 'mse', 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'log_loss', 'mape', 'f1', 'ap', 'ndcg', 'micro_f1', 'macro_f1' \"task\": 'classification', # 任务类型 \"log_file_name\": 'airlines_experiment.log', # flaml日志文件 \"seed\": 7654321, # 随机种子 \"eval_method\" : 'cv', \"n_splits\" : 5, #\"ensemble\" : True, \"ensemble\": { \"final_estimator\": LogisticRegression(), \"passthrough\": False, #True (default) or False, whether to pass the original features to the stacker. \"best_individual_model\" : True }, \"n_jobs\": 4, } automl.fit(X_train, y_train, task=\"regression\", time_budget=60, **other_settings) # Save the model with open(\"automl.pkl\", \"wb\") as f: pickle.dump(automl, f, pickle.HIGHEST_PROTOCOL) # At prediction time with open(\"automl.pkl\", \"rb\") as f: automl = pickle.load(f) pred = automl.predict(X_test) 预定义参数 Tasks (specified via task):\n‘classification’: classification with tabular data. ‘regression’: regression with tabular data. ’ts_forecast’: time series forecasting. ’ts_forecast_classification’: time series forecasting for classification. ’ts_forecast_panel’: time series forecasting for panel datasets (multiple time series). ‘rank’: learning to rank. ‘seq-classification’: sequence classification. ‘seq-regression’: sequence regression. ‘summarization’: text summarization. ’token-classification’: token classification. ‘multichoice-classification’: multichoice classification. Built-in metric.\n‘accuracy’: 1 - accuracy as the corresponding metric to minimize.\n’log_loss’: default metric for multiclass classification.\n‘r2’: 1 - r2_score as the corresponding metric to minimize. Default metric for regression.\n‘rmse’: root mean squared error.\n‘mse’: mean squared error.\n‘mae’: mean absolute error.\n‘mape’: mean absolute percentage error.\n‘roc_auc’: minimize 1 - roc_auc_score. Default metric for binary classification.\n‘roc_auc_ovr’: minimize 1 - roc_auc_score with multi_class=“ovr”.\n‘roc_auc_ovo’: minimize 1 - roc_auc_score with multi_class=“ovo”.\n‘roc_auc_weighted’: minimize 1 - roc_auc_score with average=“weighted”.\n‘roc_auc_ovr_weighted’: minimize 1 - roc_auc_score with multi_class=“ovr” and average=“weighted”.\n‘roc_auc_ovo_weighted’: minimize 1 - roc_auc_score with multi_class=“ovo” and average=“weighted”.\n‘f1’: minimize 1 - f1_score.\n‘micro_f1’: minimize 1 - f1_score with average=“micro”.\n‘macro_f1’: minimize 1 - f1_score with average=“macro”.\n‘ap’: minimize 1 - average_precision_score.\n’ndcg’: minimize 1 - ndcg_score.\n’ndcg@k’: minimize 1 - ndcg_score@k. k is an integer. 也可以自定义损失，如：\ndef custom_metric( X_val, y_val, estimator, labels, X_train, y_train, weight_val=None, weight_train=None, *args, ): from sklearn.metrics import log_loss import time start = time.time() y_pred = estimator.predict_proba(X_val) pred_time = (time.time() - start) / len(X_val) val_loss = log_loss(y_val, y_pred, labels=labels, sample_weight=weight_val) y_pred = estimator.predict_proba(X_train) train_loss = log_loss(y_train, y_pred, labels=labels, sample_weight=weight_train) alpha = 0.5 return val_loss * (1 + alpha) - alpha * train_loss, { \"val_loss\": val_loss, \"train_loss\": train_loss, \"pred_time\": pred_time, } 返回验证损失和训练损失之间的差距所惩罚的验证损失作为要最小化的指标,此时用户还可以在附加指标字典中指定对一个或多个指标的约束，如下：\nmetric_constraints = [(\"train_loss\", \"\u003c=\", 0.1), (\"val_loss\", \"\u003c=\", 0.1)] automl.fit( X_train, y_train, max_iter=100, train_time_limit=1, metric_constraints=metric_constraints, ) Estimator list ,模型列表列表可以包含一个或多个估算器名称，每个名称对应于一个内置估算器或自定义估算器。每个估算器都有一个超参数配置的搜索空间。FLAML 支持经典机器学习模型和深度神经网络。同样也可以自定义模型和搜索空间\nBuilt-in estimator.\n’lgbm’: LGBMEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, num_leaves, min_child_samples, learning_rate, log_max_bin (logarithm of (max_bin + 1) with base 2), colsample_bytree, reg_alpha, reg_lambda.\n‘xgboost’: XGBoostSkLearnEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_leaves, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.\n‘xgb_limitdepth’: XGBoostLimitDepthEstimator for task “classification”, “regression”, “rank”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_depth, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.\n‘rf’: RandomForestEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0, it uses a fixed random_state by default.\n’extra_tree’: ExtraTreesEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0, it uses a fixed random_state by default.\n‘histgb’: HistGradientBoostingEstimator for task “classification”, “regression”, “ts_forecast” and “ts_forecast_classification”. Hyperparameters: n_estimators, max_leaves, min_samples_leaf, learning_rate, log_max_bin (logarithm of (max_bin + 1) with base 2), l2_regularization. It uses a fixed random_state by default.\n’lrl1’: LRL1Classifier (sklearn.LogisticRegression with L1 regularization) for task “classification”. Hyperparameters: C.\n’lrl2’: LRL2Classifier (sklearn.LogisticRegression with L2 regularization) for task “classification”. Hyperparameters: C.\n‘catboost’: CatBoostEstimator for task “classification” and “regression”. Hyperparameters: early_stopping_rounds, learning_rate, n_estimators.\n‘kneighbor’: KNeighborsEstimator for task “classification” and “regression”. Hyperparameters: n_neighbors.\n‘prophet’: Prophet for task “ts_forecast”. Hyperparameters: changepoint_prior_scale, seasonality_prior_scale, holidays_prior_scale, seasonality_mode.\n‘arima’: ARIMA for task “ts_forecast”. Hyperparameters: p, d, q.\n‘sarimax’: SARIMAX for task “ts_forecast”. Hyperparameters: p, d, q, P, D, Q, s.\n‘holt-winters’: Holt-Winters (triple exponential smoothing) model for task “ts_forecast”. Hyperparameters: seasonal_perdiods, seasonal, use_boxcox, trend, damped_trend.\n’transformer’: Huggingface transformer models for task “seq-classification”, “seq-regression”, “multichoice-classification”, “token-classification” and “summarization”. Hyperparameters: learning_rate, num_train_epochs, per_device_train_batch_size, warmup_ratio, weight_decay, adam_epsilon, seed.\n’temporal_fusion_transformer’: TemporalFusionTransformerEstimator for task “ts_forecast_panel”. Hyperparameters: gradient_clip_val, hidden_size, hidden_continuous_size, attention_head_size, dropout, learning_rate. There is a known issue with pytorch-forecast logging.\n对于每一个模型可以额外定义参数 automl_settings = { \"task\": \"classification\", \"time_budget\": 10, \"estimator_list\": [\"catboost\", \"rf\"], \"fit_kwargs_by_estimator\": { \"catboost\": { \"verbose\": True, # setting the verbosity of catboost to True } }, } 训练时间的范围确定 关于时间的参数包括： ”time_budget“ “max_iter” : 30,#限制在 AutoML 进程中尝试的最大模型数 “train_time_limit” ：1,#训练时间(s) “pred_time_limit ”：1e-3，#每个实例的预测时间（s）\n如果希望时间有约束，可以设置time_budget参数，那么它的大小怎么设置呢？可以先设置一个较短的时间，之后查看日志有没有提示时间过短\nWARNING - All estimator hyperparameters local search has converged at least once, and the total search time exceeds 10 times the time taken to find the best model. 或设置一个较长时间，但是而外设置early_stop=True，就会在有收敛之后自动停止 如果想了解大约需要多少时间，可以先设置 max_iter=2 ，就会在日志计算预估的所需时间\nINFO - Estimated sufficient time budget=145194s. Estimated necessary time budget=2118s. 并行（默认1） n_jobs，几个线程 n_concurrent_trials，几个核，当有多个cpu时可以增加 那么如何实现并行： 1，Parallel tuning with Ray\npip install flaml[ray,blendsearch] ray.init(num_cpus=16) automl.fit(X_train, y_train, n_jobs=4, n_concurrent_trials=4) 2,Parallel tuning with Spark,试行，不支持gpu。\npip install flaml[spark,blendsearch]\u003e=1.1.0 automl.fit(X_train, y_train, n_concurrent_trials=4, use_spark=True) 集成模型 采用sklearn的 stack方法\nautoml.fit( X_train, y_train, task=\"classification\", \"ensemble\": { \"final_estimator\": LogisticRegression(),##最终估算器 \"passthrough\": False,###是否将原始特征传递给stacker }, ) Resampling strategy​ 重采样策略 eval_method to be “holdout” or “cv” for holdout or cross-validation. 对于holdout，可选split_ratio ：验证数据的分数，默认为 0.1；X_val ， y_val单独的验证数据集 对于cv,可设置n_splits 折叠数。默认情况下，它是 5\n数据拆分方式 对于不同类型模型有不同拆分方式\nstratified split for classification; 分层拆分进行分类; uniform split for regression; 回归的均匀拆分; time-based split for time series forecasting; 用于时间序列预测的基于时间的拆分; group-based split for learning to rank. 基于小组的拆分，用于学习排名。 split_type=“uniform” 改为统一分割。当 split_type in (“uniform”, “stratified”) 时，数据被洗牌 对于分类和回归模型还可以设置 split_type=“time” split_type=“group”\nwarm start(不想每次都从头开始训练) automl1 = AutoML() automl1.fit(X_train, y_train, time_budget=3600) automl2 = AutoML() automl2.fit( X_train, y_train, time_budget=7200, starting_points=automl1.best_config_per_estimator, ) ","准确度与时间绘图#准确度与时间绘图":"增加时间预算可能会进一步提高准确性\nfrom flaml.automl.data import get_output_from_log time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = get_output_from_log(filename=settings[\"log_file_name\"], time_budget=120) import matplotlib.pyplot as plt import numpy as np plt.title(\"Learning Curve\") plt.xlabel(\"Wall Clock Time (s)\") plt.ylabel(\"Validation Accuracy\") plt.step(time_history, 1 - np.array(best_valid_loss_history), where=\"post\") plt.show() ","结果#结果":"以此句为例\nautoml.fit(X_train, y_train, task=\"regression\") print(automl.model) # \u003cflaml.automl.model.LGBMEstimator object at 0x7f9b502c4550\u003e #最佳模型 print(automl.best_estimator) # lgbm #最佳参数 print(automl.best_config) # {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965} #每个模型的最佳参数 print(automl.best_config_per_estimator) # {'lgbm': {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}, 'rf': None, 'catboost': None, 'xgboost': {'n_estimators': 4, 'max_leaves': 4, 'min_child_weight': 1.8630223791106992, 'learning_rate': 1.0, 'subsample': 0.8513627344387318, 'colsample_bylevel': 1.0, 'colsample_bytree': 0.946138073111236, 'reg_alpha': 0.0018311776973217073, 'reg_lambda': 0.27901659190538414}, 'extra_tree': {'n_estimators': 4, 'max_features': 1.0, 'max_leaves': 4}} #训练时间 print(automl.best_config_train_time) # 0.24841618537902832 #训练次数 print(automl.best_iteration) # 10 #最佳loss print(automl.best_loss) # 0.15448622217577546 #找到最佳模型所用时间 print(automl.time_to_find_best_model) # 0.4167296886444092 # print(automl.config_history) # {0: ('lgbm', {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0}, 1.2300517559051514)} # Meaning: at iteration 0, the config tried is {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20, 'learning_rate': 0.09999999999999995, 'log_max_bin': 8, 'colsample_bytree': 1.0, 'reg_alpha': 0.0009765625, 'reg_lambda': 1.0} for lgbm, and the wallclock time is 1.23s when this trial is finished. flaml.automl.model.LGBMEstimator is a wrapper class（包装类）\nprint(automl.model.estimator) \"\"\" LGBMRegressor(colsample_bytree=0.7610534336273627, learning_rate=0.41929025492645006, max_bin=255, min_child_samples=4, n_estimators=45, num_leaves=4, reg_alpha=0.0009765625, reg_lambda=0.009280655005879943, verbose=-1) \"\"\" #绘制特征重要性 import matplotlib.pyplot as plt plt.barh( automl.model.estimator.feature_name_, automl.model.estimator.feature_importances_ ) "},"title":"Flaml-1-介绍"},"/docs/genome_seq/":{"data":{"test-seq#test seq":"test seq"},"title":"test seq"},"/docs/novels/":{"data":{"":"书的目录"},"title":"Novels"},"/docs/novels/the_count_of_monte_cristo/":{"data":{"":"some here The Count of Monte Cristo"},"title":"The Count of Monte Cristo"},"/docs/novels/the_little_prince/":{"data":{"":"some here The Count of Monte Cristo"},"title":"The Little Prince"},"/docs/paper_share/":{"data":{"here-is-guide#here is guide":"here is guide"},"title":"paper share"}}