• 基础技能
  • 爬取技能
  • 解析技能
  • 清洗技能
  • 存储技能
  • 反爬工具
  • 加速技能
  • 部署技能
  • 爬取工具
  • 浏览器插件
  • 关于本站
  • 本站总访问量
  • 本站总访客数
  • 博客首页
  • 评论留言
    选择您的默认搜索引擎:
    搜索热词:

    基础技能

    Python

    Python 文档

    JavaScript

    JavaScript 文档

    现代 JavaScript 教程

    以最新的 JavaScript 标准为基准。通过简单但足够详细的内容,为你讲解从基础到高阶的 JavaScript 相关知识。

    Java

    Java 文档

    C/C++

    C/C++ 文档

    Node.js

    Node.js 文档

    GO

    GO 文档


    爬取技能

    Urllib

    URL 处理模块

    urllib3

    urllib3 is a powerful, user-friendly HTTP client for Python

    httplib2

    A comprehensive HTTP client library.

    Requests

    让 HTTP 服务人类

    aiohttp

    Asynchronous HTTP Client/Server for asyncio and Python.

    PySpider

    PySpider 爬虫框架官方文档

    Scrapy

    Scrapy 爬虫框架官方文档

    requests-html

    This library intends to make parsing HTML as simple and intuitive as possible.

    pyppeteer

    Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library.

    selenium

    Selenium 是支持 web 浏览器自动化的一系列工具和库的综合项目。

    splash

    Splash is a javascript rendering service

    js2py

    Everything is done in 100% pure Python so it's extremely easy to install and use

    pyexecjs

    Run JavaScript code from Python.

    asyncio

    asyncio 是用来编写并发代码的库,使用 async/await 语法。

    gevent

    gevent is a coroutine -based Python networking library that uses greenlet to provide a high-level synchronous API on top of the libev or libuv event loop.

    Tornado

    Tornado is a Python web framework and asynchronous networking library, originally developed at FriendFeed.

    Twisted

    Twisted is an event-driven networking engine written in Python


    解析技能

    re

    Python 正则表达式官方文档

    lxml

    The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt.

    BeautifulSoup4

    Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

    cssselect2

    cssselect2 is a straightforward implementation of CSS3 Selectors for markup documents (HTML, XML, etc.) that can be read by ElementTree-like parsers (including cElementTree, lxml, html5lib_, etc.)

    html5lib

    html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

    pyquery

    pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jquery. pyquery uses lxml for fast xml and html manipulation.

    feedparser

    Universal Feed Parser is a Python module for downloading and parsing syndicated feeds.

    goose3

    goose3

    newspaper

    Article scraping & curation

    ocrmypdf

    OCRmyPDF adds an optical charcter recognition (OCR) text layer to scanned PDF files, allowing them to be searched.

    pdfminer.six

    Pdfminer.six is a python package for extracting information from PDF documents.

    pydub

    Manipulate audio with a simple and easy high level interface

    pyyaml

    PyYAML is a YAML parser and emitter for Python.

    readability

    Measure the readability of a given text using surface characteristics

    scrapely

    A pure-python HTML screen-scraping library

    untangle

    untangle is a tiny Python library which converts an XML document to a Python object.

    xml2dict

    convert xml file to python native dict object


    清洗技能

    Numpy

    Numpy 科学计算 官方中文文档

    Pandas

    Pandas 结构化数据分析 官方中文文档

    jieba

    结巴中文分词

    Matplotlib

    Matplotlib 2D绘图库 官方中文文档

    gensim

    Gensim is a FREE Python library

    nameparser

    A simple Python module for parsing human names into their individual components.

    nltk

    NLTK is a leading platform for building Python programs to work with human language data.

    phonenumbers

    Python port of Google's libphonenumber

    PyNLPIR

    PyNLPIR is a Python wrapper around the NLPIR/ICTCLAS Chinese segmentation software.

    snownlp

    SnowNLP是一个python写的类库,可以方便的处理中文文本内容

    thulac

    An Efficient Lexical Analyzer for Chinese

    xpinyin

    translate chinese hanzi to pinyin by python, inspired by flyerhzm’s chinese_pinyin gem


    存储技能

    MongoDB

    MongoDB API 文档

    pymongo

    PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python

    Redis

    Redis API 文档

    Redis

    The Python interface to the Redis key-value store.

    MySQL

    MySQL 文档

    pymssql

    A simple database interface for Python that builds on top of FreeTDSto provide a Python DB-API (PEP-249) interface to Microsoft SQL Server.

    pymysql

    Python Mysql Client

    cxOracle

    cx_Oracle is a Python extension module that enables access to Oracle Database.

    elasticsearch

    Python Elasticsearch Client

    json

    JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired byJavaScript object literal syntax

    mistune

    A fast yet powerful Python Markdown parser with renderers and plugins, compatible with sane CommonMark rules.

    psycopg2

    Python adapter for PostgreSQL

    py2neo

    Py2neo is a client library and toolkit for working with Neo4j from within Python applications and from the command line.

    pyodbc

    Python ODBC bridge

    pypdf2

    A Pure-Python library built as a PDF toolkit.

    thrift

    The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

    xlrd

    This package is for reading data and formatting information from older Excel files

    xlwt

    xlwt is a library for writing data and formatting information to older Excel files (ie: .xls)


    反爬工具

    AST explorer

    AST explorer

    JavaScript AST visualizer

    JavaScript AST visualizer

    js code to svg flowchart

    js-code-to-svg-flowchart

    阿里读光

    阿里出品的在线图片 OCR 识别应用

    Convert curl

    Convert curl syntax to Python, Ansible URI, MATLAB, Node.js, R, PHP, Strest, Go, Dart, JSON, Elixir, Rust

    百度在线字体编辑器

    百度在线字体编辑器

    奇Q在线字体编辑器

    奇Q在线字体编辑器

    httpbin

    A simple HTTP Request & Response Service.


    加速技能

    scrapy-redis

    Redis-based components for Scrapy.

    kafka

    Python client for the Apache Kafka distributed stream processing system. kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces (e.g., consumer iterators).

    celery

    Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.

    multiprocessing

    multiprocessing is a package that supports spawning processes using an API similar to the threading module.

    subprocess

    The subprocess module allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.

    threading

    This module constructs higher-level threading interfaces on top of the lower level _thread module. See also the queue module.

    fork

    Doing subprocess in Python should be easy

    huey

    a lightweight alternative.

    rabbitmq

    RabbitMQ是实现了高级消息队列协议(AMQP)的开源消息代理软件(亦称面向消息的中间件)。

    rq (Redis Queue)

    RQ (Redis Queue) is a simple Python library for queueing jobs and processing them in the background with workers.


    部署技能

    docker

    Learn how Docker helps developers bring their ideas to life by conquering the complexity of app development.

    kuberneters

    Kubernetes 是用于自动部署,扩展和管理容器化应用程序的开源系统。

    openshift

    Red Hat OpenShift is an open source container application platform based on the Kubernetes container orchestrator for enterprise app development and deployment.

    scrapyd

    Scrapyd is an application for deploying and running Scrapy spiders. It enables you to deploy (upload) your projects and control their spiders using a JSON API.

    scrapyd-client

    Scrapyd-client is a client for scrapyd.

    python-scrapyd-api

    python-scrapyd-api is a very simple Python wrapper for working withScrapyd‘s API;it allows a Python application to talk to, and therefore control, the Scrapy Daemon.

    scrapydweb

    用于 Scrapyd 集群管理的 web 应用,支持 Scrapy 日志分析和可视化。

    crawlab

    分布式爬虫管理平台-量身打造的企业级产品,让您轻轻松松管理爬虫


    爬取工具

    anyproxy

    AnyProxy是一个开放式的HTTP代理服务器。

    Appium

    Mobile App Automation Made Awesome.

    Charles

    Charles is an HTTP proxy / HTTP monitor / Reverse Proxy that enables a developer to view all of the HTTP and SSL / HTTPS traffic between their machine and the Internet.

    Google Chrome

    Google Chrome 网络浏览器

    Microsoft Edge

    Google Chrome 网络浏览器

    Fiddler

    Fiddler is a free web debugging tool which logs all HTTP(S) traffic between your computer and the Internet. Inspect traffic, set breakpoints, and fiddle with incoming or outgoing data.

    mitmproxy

    mitmproxy is a free and open source interactive HTTPS proxy.

    wireshark

    Wireshark is a network packet analyzer. A network packet analyzer presents captured packet data in as much detail as possible.


    浏览器插件

    EditThisCookie

    EditThisCookie is a cookie manager. You can add, delete, edit, search, protect and block cookies!

    Tampermonkey

    Tampermonkey is the most popular userscript manager, with over 10 million weekly users. It's available for Microsoft Edge, Chrome, Safari, Opera Next, and Firefox.

    ReRes

    ReRes 可以用来更改页面请求响应的内容。通过指定规则,您可以把请求映射到其他的url,也可以映射到本机的文件或者目录。ReRes支持单个url映射,也支持目录映射。

    XPath Helper

    Extract, edit, and evaluate XPath queries with ease.

    Proxy SwitchyOmega

    轻松快捷地管理和切换多个代理设置。

    JSON Formatter

    Makes JSON easy to read. Open source.


    © 2020 - 2021 Sitoi | Power by Hexo | 沪ICP备18037784号-4