ruby 中使用 elasticsearch 查询超过 10000 条数据-白红宇

ruby 中使用 elasticsearch 查询超过 10000 条数据

阅读量：4229 次

发布时间：2019-05-26

本文共 9661 字，大约阅读时间需要 32 分钟。

结果参数含义

一般来说，查询的结果类似于下边这种的（我转换成了 JSON 格式的）：

{  "took": 993,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ...

可以看到结果里边有一些参数，参数含义如下：

took 表示 Elasticsearch 执行搜索所用的时间，单位是毫秒。

timed_out 用来指示搜索是否超时。

_shards 指示搜索了多少分片，以及搜索成功和失败的分片的计数。

hits 用来保存实际搜索结果集。

hits.total 是包含与搜索条件匹配的文档总数信息的对象。

hits.total.value 表示总命中计数的值（必须在hits.total.relation上下文中解释）。

确切来说默认情况下，hits.total.value 是不确切的命中计数，在这种情况下，当 hits.total.relation 的值是 eq 时，hits.total.value 的值是准确计数。当 hits.total.relation 的值是 gte 时，hits.total.value的值是不准确的（实际满足匹配的结果集更多）。

hits.hits 是存储搜索结果的实际数组（默认为前10个文档）。

hits.sort 表示结果排序键（如果请求中没有指定，则默认按分数排序）。

hits._index 索引名。

hits._type 索引类型。

hits. _id 操作ID。

hits._score 文档与查询的匹配程度。

原始数据

官方的数据可以从 GitHub 下载：。account.json 结构类似于下面这种：

{"index":{"_id":"1"}}{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}{"index":{"_id":"6"}}{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}{"index":{"_id":"13"}}{"account_number":13,"balance":32838,"firstname":"Nanette","lastname":"Bates","age":28,"gender":"F","address":"789 Madison Street","employer":"Quility","email":"nanettebates@quility.com","city":"Nogal","state":"VA"}{"index":{"_id":"18"}}{"account_number":18,"balance":4180,"firstname":"Dale","lastname":"Adams","age":33,"gender":"M","address":"467 Hutchinson Court","employer":"Boink","email":"daleadams@boink.com","city":"Orick","state":"MD"}...

构造数据

但是这些数据量还根本不够 10000 条，所以没法进行超量测试。不过，我们可以自己写个脚本把这个 account.json 扩容一下（写得很粗糙，请不要介意），我这儿只是把原来的数据的 id 改了一下创建了新的数据，你可以有自己的想法，只要数据量够就可以了，构造的新数据存在 test.json 文件里边：

# Rubyrequire 'json'dat = File.open("accounts.json").readlinesmyFile = File.new("test.json","a+")(1..30).each do |i|  data = Array.new(dat)  data.each do |line|    line = JSON.parse(line)    if line.key? "index"      line["index"]["_id"] = (line["index"]["_id"].to_i * i).to_s    end    if line.key? "account_number"      line["account_number"] *= i    end    myFile.puts line.to_json  endendmyFile.close

数据入库

然后将构造的数据使用下面的命令批量导入到 elasticsearch 数据库：

[looking@master ruby_learning]$ curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' -H 'Content-Type: application/json' --data-binary "@test.json"

查看索引

我们可以看到索引 bank 的数据量是超过 10000 的，所以我们来进行超量测试应该是没什么问题的。

[looking@master ruby_learning]$ curl localhost:9200/_cat/indices?vhealth status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.sizeyellow open   megacorp Ed5gg9hoRM24dE3AAD1DEQ   1   1          3            0     11.5kb         11.5kbyellow open   website  svdjlzy2TfCQFLHcKWhjRw   1   1          2            0      8.9kb          8.9kbyellow open   bank     cSxDPBsyRVyjsHGe9b2RZA   1   1      13022        16978      5.6mb          5.6mbyellow open   blogs    qfnun_91RI2O1lgTjnBmCQ   3   1          0            0       849b           849b

数据查询

先写个简单的数据查询脚本：

# Rubyrequire 'elasticsearch'require 'json'host =  '127.0.0.1'port = 9200client = Elasticsearch::Client.new url: "http://#{host}:#{port}"size = 10query = {    query: {        'match_all': {}    },     size: size}result = client.search index: 'bank', body: queryputs JSON.pretty_generate(result)

输出了 10 条数据，输出结果也很好看：

{  "took": 1,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ......      },      {        "_index": "bank",        "_type": "account",        "_id": "347",        "_score": 1.0,        "_source": {          "account_number": 347,          "balance": 36038,          "firstname": "Gould",          "lastname": "Carson",          "age": 24,          "gender": "F",          "address": "784 Pulaski Street",          "employer": "Mobildata",          "email": "gouldcarson@mobildata.com",          "city": "Goochland",          "state": "MI"        }      }    ]  }}

我们修改 size = 10000，好像还可以正常运行。

我们修改 size = 10001，好像出问题了：

[looking@master ruby_learning]$ ruby test2.rbTraceback (most recent call last):	5: from test2.rb:14:in `
   
    '	4: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-api-7.9.0/lib/elasticsearch/api/actions/search.rb:103:in `search'	3: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/client.rb:176:in `perform_request'	2: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'	1: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/base.rb:347:in `perform_request'/usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/base.rb:218:in `__raise_transport_error': [400] {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"bank","node":"hmeiFSEDRZK4hY0jQ1eV7Q","reason":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.","caused_by":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}}},"status":400} (Elasticsearch::Transport::Transport::Errors::BadRequest)

修改配置

这儿修改了 bank 索引的配置，重新设置了 max_result_window 的值。

[looking@master ruby_learning]$ curl  -XPUT "localhost:9200/bank/_settings?pretty" -H "Content-Type: application/json" -d '> {>     "index" : { "max_result_window" : 100000000}> }> '{  "acknowledged" : true}

再次查询

再次运行脚本查询，这次虽然没报错了，但是结果仍然只有 10000 条：

{  "took": 993,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ...

track_total_hits

查询语句中加入：track_total_hits: true。

require 'elasticsearch'require 'json'host =  '127.0.0.1'port = 9200client = Elasticsearch::Client.new url: "http://#{host}:#{port}"size = 10001query = {    track_total_hits: true,    query: {        'match_all': {}    },    size: size,}result = client.search index: 'bank', body: queryputs JSON.pretty_generate(result)

再次查询

这次我们看到索引统计的结果：

[looking@master ruby_learning]$ curl -X GET "localhost:9200/_cat/indices?"vhealth status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.sizeyellow open   bank     cSxDPBsyRVyjsHGe9b2RZA   1   1      13022        16978      5.6mb          5.6mbyellow open   website  svdjlzy2TfCQFLHcKWhjRw   1   1          2            0      8.9kb          8.9kbyellow open   megacorp Ed5gg9hoRM24dE3AAD1DEQ   1   1          3            0     11.5kb         11.5kbyellow open   blogs    qfnun_91RI2O1lgTjnBmCQ   3   1          0            0       849b           849b

和下边

# result.json{  "took": 215,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 13022,      "relation": "eq"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {          ...

里的统计结果保持一致（hits.total.value 表示总命中计数的值，由于 hits.total.relation 的值是 eq，说明 hits.total.value 的值是准确计数）：

"total": {      "value": 13022,      "relation": "eq"    }

真正返回的数据条数仍然为 size 设置的条数（当 size < total 的时候）：

# Ruby# test.rbrequire 'json'aa = JSON.load(File.open('result.json'))puts aa['hits']['hits'].size------------------------------------------------------------[looking@master ruby_learning]$ ruby test.rb10001

如果你把size 设置超过文档总数的话，也就把全部查询结果返回来了（比如我设置 size = 20000）：

# Ruby# test.rbrequire 'json'aa = JSON.load(File.open('result.json'))puts aa['hits']['hits'].size------------------------------------------------------------[looking@master ruby_learning]$ ruby test.rb13022

转载地址：http://zjjqi.baihongyu.com/

你可能感兴趣的文章

SQL For Dummies

查看>>

Data Structures for Game Programmers

查看>>

Hacking Google Maps and Google Earth

查看>>

Code Design for Dependable Systems: Theory and Practical Applications

查看>>

Elements of Information Theory

查看>>