Open your mind, and open your eyes. (放眼未来, 自由想象)
Email: dirtysalt1987 AT gmail DOT com
GitHub: https://github.com/dirtysalt/
LinkedIn: https://www.linkedin.com/in/dirtysalt
Talks:
- 友盟数据分析平台架构 Umeng Analytical Architecture
- 中国系统架构师大会(SACC 2014) 如何在一天之内收集3.6亿移动设备的数据
Extensive experience in:
- large-scale distributed system design and implementation.
- network programming framework design and implementation.
- storage system design and implementation.
- performance optimization and tuning for systems and applications.
- big data process and analysis, data mining and machine learning.
Specialties:
- proficient in C/C++, Python, Java.
- solid knowledge of data structure and algorithm.
- extremely familiar with system development on Linux.
Microsoft.com Bing IndexServe Team.
Works on Quality Platform for Bing IndexServe.
Amazon.com Search Experience Team.
Develop UX Guardrails, which works in the CI/CD pipelines, and is to ensure the rendered web elements on the search page are compliant with internal UX design guidelines.
Software Engineer, Head of Backend Development, CastBox.FM, 2016.4 - 2020.6
- crawler system. It crawls podcasts available on the internet and notifies uses once new episodes are available. The software stacks include Python, Requests, Beautifulsoup, FeedParser, Squid, Redis, MongoDB etc. By collecting RSS feeds submitted by users and search query, the number of podcasts in the database has been increased from 200K to 600K, and the number of episodes has been increased from 20M to 40M. By applying machine learning algorithm on the released date of episodes in the past, we predict the future released date and increase responsiveness, that new episodes can be fetched by our crawler in 5 minutes after they are released by podcasters, and users on the app will be notified just in time. Meanwhile, we optimize images of episodes by compression and cropping, reduce the size of images from MB to less than 300KB without much loss on image quality, which saves network traffic and reduces image loading time on the mobile app.
- search system. Users can search for podcasts and episodes by keywords and be provided keyword suggestions. It’s developed on ElasticSearch and supports up to 12 languages including English, Portuguese, Spanish, German, Dutch, CJK etc. Data shows us more than 1/3 users subscription comes from the search system, so we put many efforts on improving and optimizing search system from following aspects.
- index freshness. Once an episode is fetched by our crawler system, a message will be put into the message queue(Redis) and triggers the indexing system. The latency of the whole pipeline is less than 10 seconds and more than 20k episodes are indexed per day.
- search latency. By using cache effectively and fine-tuning Elasticsearch, we control the latency of search API under 200ms and the latency of suggestion API under 10ms.
- search relevance. Besides document relevance score returned by Elasticsearch, we add many signals including play numbers and subscription numbers in the past in recent days, to get a better relevance score.
- recommender system. By analysis user subscription data and applying collaborative filtering algorithm, we can recommend users podcasts that they may like and find similar podcasts. We use LightFM python library and apply WARP algorithm on user subscription data in recent 3 months. With fine tuning of parameters and A/B Testing, we raise CTR of user recommended podcasts from 2.16% to 4.52%, and CTR of similar podcasts from 1.90% to 3.19%.
Software Engineer, Logzilla, 2015.4 - 2015.8 (Remote, as Consultant)
A real-time event analytical platform.
- performance tuning to support ~200K eps(event per second).
- implement a new event storage engine to support ~1M eps(event per second).
Software Engineer, Galera, 2014.4 - 2014.11 (Remote, as Consultant)
A drop-in plugin of MySQL multi-master.
Optimize cluster recovery process regarding data center outage case, and reduce recovery time from the 30s to less than 3s.
Software Architect, Data Platform, Umeng, 2012.6 - 2016.4
- design Umeng internal Realtime+Batch Architecture. (aka. Lambda Architecture http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html)
- kvproxy, an asynchronous high-performance HTTP server for easily accessing various database systems such as HBase, MySQL, Riak etc. It’s written in Scala and Finagle, use Google Protocol-Buffers as data exchange format and Google Guava LRUCache as the application-level cache. Since Finagle wraps an asynchronous function in a concept of ‘Future’ and encourages the developer to take server as a function(Your Server as a Function. http://monkey.org/~marius/funsrv.pdf), so kvproxy could be used not only as a server but also a library that could be easily embedded into other applications.
- performance tuning of MapReduce jobs and Hadoop cluster usage from perspectives of
- application. use HBase bulk-loading instead of writing data to HBase directly for better throughput and stability.
- algorithm. use HyperLogLog algorithm instead of using set to calculate cardinality for better performance and any-time-range query ability.
- system. turn off MapReduce speculative mode when reading data from HBase.
- language. use JNI instead of pure Java code to accelerate CPU computation.
- kernel. configure kernel parameters like /proc/sys/vm/zone_reclaim_mode and /sys/kernel/mm/redhat_transparent_hugepage/enabled.
- FastHBaseRest, an asynchronous high-performance HTTP server written in Netty for easily accessing HBase in multiple languages by using Google Protocol-Buffers. Since HBase only provides underlying block cache, FastHBaseRest implements item cache on application level using Google Guava for better read performance. Comparing to HBase embedded HTTP server(‘hbase rest’), the access latency is 20% lower and transfer size is 40% lower. Meanwhile, it has more capabilities like request rewriting.
- usched, an internal job scheduler system to arrange jobs which are codependent. It defines and implements a DSL called JDL(Job Description Language) which is used to describe dependencies between jobs and properties of jobs. It runs as an HTTP server and provides a web-console to manage jobs including submissions and running status dashboard etc. Thousand MapReduce jobs are scheduled by USched each day while the latency is below 5sec.
Senior Software Engineer, Baidu, 2008.7 - 2012.6
- dstream, an in-house distributed real-time stream processing system in C++ like Twitter’s Storm and Yahoo!’s S4. The alpha version of 10 nodes cluster can process 1 million tuples per second while keeping the latency less than 100ms.
- comake2, an in-house build system in Python, takes advantages of some open-source build systems such as SCons, CMake, Google’s GYP, Boost’s Jam etc. It has been wildly used in Baidu for continuous integration.
- infpack, an in-house data exchange format in C++. Comparing to Google’s Protocol-Buffers and Facebook’s Thrift, the speed of serialization and deserialization is about 20~30% faster while size is 10~20% smaller. The generated code is carefully hand-tuned so implementation is very efficient.
- ddbs(distributed database system), an in-house distributed relational database system. I mainly worked on SQL parser to extend syntax for more capability and implementing a SPASS(single point automatic switch system) for its fault-tolerant feature.
- maintainer and developer of Baidu common libraries including BSL(Baidu standard library), ullib(wraps socket io, file io, and some Linux syscalls etc.), comdb(an embedded high-performance key-value storage system), memory allocator, character encoding, regular expression, signature and hash algorithm, URL handling, HTTP client, lock-free data structures and algorithms etc.
- vitamin, an in-house tool to detect the potential bugs in C/C++ source code by static analyzation. It reports thousands of valuable warnings by scanning the whole of Baidu’s code repository while keeping the rate of fake warnings relatively low.
- IDL compiler, an in-house compiler translates a DSL(domain specified language) to the code that supports data exchange between C/C++ struct/class and Mcpack(an in-house data pack like Google’s Protocol-Buffers) using Flex and Bison.
- itachi, a simple high-performance asynchronous network programming framework in C++. GitHub
- nasty, a simple lisp-syntax parser in C++ using Flex and Bison. GitHub
- brainfuck-llvm-jit, a simple JIT compiler of brainfuck using LLVM. GitHub
- MS. Computer Science. Shandong University
- BE. Electronic Engineering. Shandong University
- 熟悉C++, Python, Java等语言
- 熟悉数据结构和算法
- 精通大规模分布式系统设计和实现
- 熟悉网络编程/存储系统/数据库系统的设计和实现
参与开发Bing IndexServe的质量平台。
参与开发UX Guardrails系统,该系统工作在CI/CD中,用于确保电商搜索页面中网页元素符合内部UX设计准则。
后端服务技术负责人, CastBox, 2016.4 - 2020.6
- 爬虫系统,抓取互联网上所有公开的播客,并且能够及时地将最新内容推送给用户。使用技术有 Python, Requests, BeautifulSoup, Squid. 考虑到播客数据不太容易结构化,使用MongoDB做存储系统。通过收集用户提交的RSS和用户搜索词,将平台收录的播客数量从20w提高到60w,单集数量从2000w提高到4000w,收录完整性上远超竞品。使用机器学习算法,根据播客单集历史发布时间预测未来单集的发布时间,我们对头部播客的检查更新延迟可以降低到5分钟以内接近于实时,用户可以在第一时间收到播客更新的推送。同时我们对播客和单集的图片进行压缩和裁剪优化,将MB级别的图片缩小至300KB以内,让客户端节省下载流量和减少加载时间。
- 搜索系统,通过关键词来查询平台上收录的播客和单集,支持联想词提示功能。系统基于ElasticSearch开发,支持的语言多达12种,包括英语,葡语,西语,德语,中日韩等。后台数据显示用户订阅有超过1/3来自于搜索,为此我们从索引及时性,检索速度和相关性排序三个方面改进搜索系统。索引及时性方面,爬虫系统检查到播客或者是单集数据发生变化,通过Message Queue的方式通知检索系统进行索引,这个pipeline延迟在10s以内,平均每天有超过2w个文档被重新索引;检索速度方面,通过缓存和对ElasticSearch的调优,将关键词检索延迟减低到200ms以内,联想词提示在10ms以内;在排序相关性上,除了使用ElasticSearch返回的文档相关性分数外,还使用了播客和单集的总订阅量和播放量,最近1天和7天的订阅量和播放量等特征,综合起来作为相关性分数。
- 推荐系统,通过分析用户的订阅数据,来给用户推荐播客和找到相似播客。用户的数据量大约在1000w左右,有订阅的播客数据量大约在13w,矩阵稀疏度在0.86左右。考虑到如果使用单集作为item的话,那么矩阵可能会更加稀疏,协同过滤算法的有效性会下降,并且计算量也会大很多,所以没有实现单集级别的推荐。使用LightFM的WARP算法训练用户在最近3个月内的订阅数据,将相似播客推荐的CTR从原来的2.16%提升到4.52%, 将给用户推荐播客的CTR则从原来的1.90%提升到3.19%.
- 其他App开发:
- Picasso: 使用神经网络做图像风格迁移的Android App. 类似Prisma这款应用。
- CashBox: 有奖问题比赛,主要使用的技术是WebSocket(Socket.IO).
- Alexa Skill(CastBox): 可以通过Alexa在CastBox平台上订阅和收听播客。
高级软件架构师, 友盟, 2012.6 - 2016.4
- 中国系统架构师大会(SACC 2014) 如何在一天之内收集3.6亿移动设备的数据
- 设计和实现realtime+batch架构(lambda架构). 利用批量计算结果来对实时计算结果进行补充。因为批量计算能够以全量数据作为输入能够获得更准确的结果并且容错性强但是延迟在小时级别,而实时计算虽然在延迟上在秒级别但是因为没有全量数据所以不能够进行更加深入分析。通过向realtime+batch架构演变,使得友盟统计能够在延迟和分析深入程度上都获得优势。
- 优化Hadoop集群使用。通过分析在Hadoop集群上存放数据以及运行任务的特征进行相关优化
- 在elephant-bird上增加lzma算法,作用在冷数据上相比lzo算法空间节省60%以上。
- 优化HBase的使用
- 避免使用直接输出到hbase的方法而采用bulk-load方式提高吞吐。
- 移除一些在hbase上的hash-join而替换成以hbase scan作为input的sort-merge join.
- 在一些date prefix rowkey的table上,对rowkey头部增加hashcode来打散数据在region上分布。
- 使用HyperLogLog算法来计算独立设备等需要去重指标,提高效率同时使得跨任意时间段查询成为可能。
- 使用jni(java native interface)来重写CPU密集型的计算。
- 支持多语言访问HBase的异步高性能服务FastHBaseRest. 传输协议使用HTTP, 数据交换格式使用protobuf来达到多语言访问目的,底层使用asynchbase对hbase进行异步访问来提高吞吐。因为hbase内部只有在block-cache而没有item-cache, 通过在服务内部使用guava编写的应用层级别LRU cache可以有效减少访问延迟。服务模块化易于扩展,支持rewrite request功能可以屏蔽底层hbase schema的变化。相比hbase rest, 传输延迟减少20%, 传输数据减少40%.
- 任务调度器usched. 通过调研一些业界已有的任务调度器比如oozie, azkaban等,然后结合友盟内部任务执行情况特点开发的任务调度器。系统定义了任务描述语言(JDL)允许指定任务之间的相互依赖关系,开始运行的时间以及一些触发条件,可以来对任务执行做精细化控制。usched通过HTTP请求提交任务和控制任务,有相对比较完善的web-console来管理,并且内置任务报警,命令运行输出重定向等功能。友盟每天运行的几百个Hadoop任务都是通过usched来进行调度的,调度延迟在5s以内。
- Logzilla, 2015.4 - 2015.8. 重写原有事件数据(event data)存储引擎。在每秒写入事件数量指标上,SSD上从500K提升到3M, HDD上从100K提升到1.2M.
- Galera, 2014.4 - 2014.11. 针对DC断电这种情况改进集群恢复机制,将集群恢复时间从30s降低到3s以内。
- 分布式实时流式计算系统dstream, 针对需要实时处理流式数据的应用场景,解决hadoop批量处理模型不能够实时处理大数据的问题。经过调研和对比很多已有的分布式实时流式计算系统比如streambase, storm等同时考虑百度自身应用需求,dstream可以在处理模型上保证数据不乱序不重复不丢失并且保持高吞吐和较低的延迟。众多产品线包括百度网页搜索检索实时反作弊,百度网页搜索点击实时反作弊,百度网盟等都正在基于dstream进行开发。alpha版本10节点集群处理性能可以达到1M packets/s, 处理延迟保证在100ms以内。
- 异步网络编程框架itachi, 主要用来解决网络上系统需要处理client慢连接或者是系统连接后端,而同时需要达到高吞吐的问题。经过调研并且深入分析了很多开源的网络编程框架以及相关项目比如hpserver, muduo, boost.asio,libev, zeromq等,但是发现没有相对完整的高性能异步网络编程框架,所以动手实现。之后打算基于这个网络编程框架实现一些分布式组件或系统。itachi ping-pong可以达到千兆网卡极限而cpu idle保持在60%, 慢连接能够轻松处理C100K.
- 数据传输/存储格式infpack, 基于对于一些业界已有的实现如Google的protobuf和Facebook的thrift的调研分析,通过在格式上将schema和实际数据分开,来降低数据包体积,提高打包和解包的性能。现在百度网页库的存储系统已经使用infpack来作为底层数据传输和存储的格式。infpack在数据包体积大小上比protobuf小5-10%,压缩和解压效率比protobuf提高20-30%。
- 分布式数据库DDBS单点自动切换系统和ESQL解释器。DDBS是master-slave结构,通过将单机MySQL数据合理地sharding到不同的机器上来提高读写性能。单点自动切换系统能够在master出现故障之后协调slave选出新的master同时保持节点之间数据强一致。用户可以通过编写ESQL来告诉DDBS如何进行数据sharding. 现在百度凤巢已经基本上全面使用DDBS.
- 持续集成开发构建系统comake2。通过调研和使用很多已有的开源构建系统比如Google的GYP, CMake, SCons等,然后结合百度内部开发情况开发的高度定制化的构建系统。现在百度内部已经有近百个项目都在使用comake2作为构建系统进行持续集成开发。comake2因为是动态语言Python编写并且机制透明,现已经有不同的项目组贡献了十几个插件。总体来说现该系统已经可以很好地支持Baidu内部持续集成开发需求。
- 开发和维护百度基础库。包括通用数据结构,lock-free B-Trees, HTTP客户端,URL处理,字符编码识别和转换,字典,正则表达式,多模匹配,签名算法,内存分配器,数据交换格式,IDL编译器,单机存储系统,网络传输系统等。