博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
ruby 中使用 elasticsearch 查询超过 10000 条数据
阅读量:4229 次
发布时间:2019-05-26

本文共 9661 字,大约阅读时间需要 32 分钟。

结果参数含义

一般来说,查询的结果类似于下边这种的(我转换成了 JSON 格式的):

{  "took": 993,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ...

可以看到结果里边有一些参数,参数含义如下: 

  • took 表示 Elasticsearch 执行搜索所用的时间,单位是毫秒。
  • timed_out 用来指示搜索是否超时。
  • _shards 指示搜索了多少分片,以及搜索成功和失败的分片的计数。
  • hits 用来保存实际搜索结果集。
  • hits.total 是包含与搜索条件匹配的文档总数信息的对象。
  • hits.total.value 表示总命中计数的值(必须在hits.total.relation上下文中解释)。
  • 确切来说默认情况下,hits.total.value 是不确切的命中计数,在这种情况下,当 hits.total.relation 的值是 eq 时,hits.total.value 的值是准确计数。当 hits.total.relation 的值是 gte 时,hits.total.value的值是不准确的(实际满足匹配的结果集更多)。
  • hits.hits 是存储搜索结果的实际数组(默认为前10个文档)。
  • hits.sort 表示结果排序键(如果请求中没有指定,则默认按分数排序)。
  • hits._index 索引名。
  • hits._type 索引类型。
  • hits. _id 操作ID。
  • hits._score 文档与查询的匹配程度。

原始数据

官方的数据可以从 GitHub 下载:。account.json 结构类似于下面这种:

{"index":{"_id":"1"}}{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}{"index":{"_id":"6"}}{"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}{"index":{"_id":"13"}}{"account_number":13,"balance":32838,"firstname":"Nanette","lastname":"Bates","age":28,"gender":"F","address":"789 Madison Street","employer":"Quility","email":"nanettebates@quility.com","city":"Nogal","state":"VA"}{"index":{"_id":"18"}}{"account_number":18,"balance":4180,"firstname":"Dale","lastname":"Adams","age":33,"gender":"M","address":"467 Hutchinson Court","employer":"Boink","email":"daleadams@boink.com","city":"Orick","state":"MD"}...

构造数据

但是这些数据量还根本不够 10000 条,所以没法进行超量测试。不过,我们可以自己写个脚本把这个 account.json 扩容一下(写得很粗糙,请不要介意),我这儿只是把原来的数据的 id 改了一下创建了新的数据,你可以有自己的想法,只要数据量够就可以了,构造的新数据存在 test.json 文件里边:

# Rubyrequire 'json'dat = File.open("accounts.json").readlinesmyFile = File.new("test.json","a+")(1..30).each do |i|  data = Array.new(dat)  data.each do |line|    line = JSON.parse(line)    if line.key? "index"      line["index"]["_id"] = (line["index"]["_id"].to_i * i).to_s    end    if line.key? "account_number"      line["account_number"] *= i    end    myFile.puts line.to_json  endendmyFile.close

数据入库

然后将构造的数据使用下面的命令批量导入到 elasticsearch 数据库:

[looking@master ruby_learning]$ curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' -H 'Content-Type: application/json' --data-binary "@test.json"

查看索引

我们可以看到索引 bank 的数据量是超过 10000 的,所以我们来进行超量测试应该是没什么问题的。

[looking@master ruby_learning]$ curl localhost:9200/_cat/indices?vhealth status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.sizeyellow open   megacorp Ed5gg9hoRM24dE3AAD1DEQ   1   1          3            0     11.5kb         11.5kbyellow open   website  svdjlzy2TfCQFLHcKWhjRw   1   1          2            0      8.9kb          8.9kbyellow open   bank     cSxDPBsyRVyjsHGe9b2RZA   1   1      13022        16978      5.6mb          5.6mbyellow open   blogs    qfnun_91RI2O1lgTjnBmCQ   3   1          0            0       849b           849b

数据查询

先写个简单的数据查询脚本:

# Rubyrequire 'elasticsearch'require 'json'host =  '127.0.0.1'port = 9200client = Elasticsearch::Client.new url: "http://#{host}:#{port}"size = 10query = {    query: {        'match_all': {}    },     size: size}result = client.search index: 'bank', body: queryputs JSON.pretty_generate(result)

输出了 10 条数据,输出结果也很好看:

{  "took": 1,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ......      },      {        "_index": "bank",        "_type": "account",        "_id": "347",        "_score": 1.0,        "_source": {          "account_number": 347,          "balance": 36038,          "firstname": "Gould",          "lastname": "Carson",          "age": 24,          "gender": "F",          "address": "784 Pulaski Street",          "employer": "Mobildata",          "email": "gouldcarson@mobildata.com",          "city": "Goochland",          "state": "MI"        }      }    ]  }}

我们修改 size = 10000,好像还可以正常运行。

我们修改 size = 10001,好像出问题了:

[looking@master ruby_learning]$ ruby test2.rbTraceback (most recent call last):	5: from test2.rb:14:in `
' 4: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-api-7.9.0/lib/elasticsearch/api/actions/search.rb:103:in `search' 3: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/client.rb:176:in `perform_request' 2: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request' 1: from /usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/base.rb:347:in `perform_request'/usr/local/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/elasticsearch-transport-7.9.0/lib/elasticsearch/transport/transport/base.rb:218:in `__raise_transport_error': [400] {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"bank","node":"hmeiFSEDRZK4hY0jQ1eV7Q","reason":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.","caused_by":{"type":"illegal_argument_exception","reason":"Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."}}},"status":400} (Elasticsearch::Transport::Transport::Errors::BadRequest)

修改配置

这儿修改了 bank 索引的配置,重新设置了 max_result_window 的值。

[looking@master ruby_learning]$ curl  -XPUT "localhost:9200/bank/_settings?pretty" -H "Content-Type: application/json" -d '> {>     "index" : { "max_result_window" : 100000000}> }> '{  "acknowledged" : true}

再次查询

再次运行脚本查询,这次虽然没报错了,但是结果仍然只有 10000 条:

{  "took": 993,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 10000,      "relation": "gte"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {         ...

 track_total_hits

查询语句中加入:track_total_hits: true。

require 'elasticsearch'require 'json'host =  '127.0.0.1'port = 9200client = Elasticsearch::Client.new url: "http://#{host}:#{port}"size = 10001query = {    track_total_hits: true,    query: {        'match_all': {}    },    size: size,}result = client.search index: 'bank', body: queryputs JSON.pretty_generate(result)

再次查询

这次我们看到索引统计的结果:

[looking@master ruby_learning]$ curl -X GET "localhost:9200/_cat/indices?"vhealth status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.sizeyellow open   bank     cSxDPBsyRVyjsHGe9b2RZA   1   1      13022        16978      5.6mb          5.6mbyellow open   website  svdjlzy2TfCQFLHcKWhjRw   1   1          2            0      8.9kb          8.9kbyellow open   megacorp Ed5gg9hoRM24dE3AAD1DEQ   1   1          3            0     11.5kb         11.5kbyellow open   blogs    qfnun_91RI2O1lgTjnBmCQ   3   1          0            0       849b           849b

和下边

# result.json{  "took": 215,  "timed_out": false,  "_shards": {    "total": 1,    "successful": 1,    "skipped": 0,    "failed": 0  },  "hits": {    "total": {      "value": 13022,      "relation": "eq"    },    "max_score": 1.0,    "hits": [      {        "_index": "bank",        "_type": "account",        "_id": "1",        "_score": 1.0,        "_source": {          ...

里的统计结果保持一致(hits.total.value 表示总命中计数的值,由于 hits.total.relation 的值是 eq,说明 hits.total.value 的值是准确计数):

"total": {      "value": 13022,      "relation": "eq"    }

真正返回的数据条数仍然为 size 设置的条数(当 size < total 的时候):

# Ruby# test.rbrequire 'json'aa = JSON.load(File.open('result.json'))puts aa['hits']['hits'].size------------------------------------------------------------[looking@master ruby_learning]$ ruby test.rb10001

如果你把size 设置超过文档总数的话,也就把全部查询结果返回来了(比如我设置 size = 20000):

# Ruby# test.rbrequire 'json'aa = JSON.load(File.open('result.json'))puts aa['hits']['hits'].size------------------------------------------------------------[looking@master ruby_learning]$ ruby test.rb13022

 

转载地址:http://zjjqi.baihongyu.com/

你可能感兴趣的文章
SQL For Dummies
查看>>
Data Structures for Game Programmers
查看>>
Hacking Google Maps and Google Earth
查看>>
Code Design for Dependable Systems: Theory and Practical Applications
查看>>
Elements of Information Theory
查看>>
Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance
查看>>
Digital Multimedia Perception and Design
查看>>
Dreamweaver 8 All-in-One Desk Reference For Dummies
查看>>
JavaScript Design
查看>>
Beginning Mac OS X Tiger Dashboard Widget Development
查看>>
Professional Live Communications Server
查看>>
Microsoft Exchange Server 2003 Advanced Administration
查看>>
Performance Analysis of Communications Networks and Systems
查看>>
SQL Server CE Database Development with the .NET Compact Framework
查看>>
Service Design for Six Sigma: A Roadmap for Excellence
查看>>
Maximum Security (3rd Edition)
查看>>
Discovering Knowledge in Data: An Introduction to Data Mining
查看>>
Computer Applications in Pharmaceutical Research and Development
查看>>
Software Measurement and Estimation: A Practical Approach
查看>>
Microsoft SQL Server 2005 Express Edition For Dummies
查看>>