尿隐血弱阳性什么意思| 胆红素升高是什么原因| 腿抖是什么病的预兆| 嘴唇为什么会变黑| 黄喉是什么动物身上的| 化疗后骨髓抑制是什么意思| 鼻炎是什么引起的| 头皮痒用什么洗头好| 发福是什么意思| 企业bg是什么意思| 小孩肠系膜淋巴结炎吃什么药| 94年属什么| 大雄宝殿供奉的是什么佛| 铎读什么| 牙龈紫黑是什么原因| 什么是备孕| 1998属什么| 甲减吃什么| 骨强度不足是什么原因| 扬字五行属什么| hpd是什么意思| 兽中之王是什么动物| 敏是什么意思| 一直打哈欠是什么原因| 神经纤维是什么| 算了吧什么意思| 房性早搏是什么意思| 出汗发粘是什么原因| 感冒流黄鼻涕吃什么药| 胆固醇高应注意什么| 经常熬夜吃什么好| 指南针什么时候发明的| 衣原体感染男性吃什么药| 东方不败练的什么武功| 确幸是什么意思| 肛门里面有个肉疙瘩是什么| 山东济南有什么好玩的地方| 肺气肿吃什么药最有效| 鹅口疮用什么药效果好| 不易是什么意思| 大腿青筋明显是什么原因| baby是什么意思| 吃烧烤后吃什么水果可以帮助排毒| 脑供血不足是什么症状| 什么才是真正的爱情| 脸上长小疙瘩是什么原因| 什么的歌声| 王京读什么| 2018是什么生肖| 什么是漏斗胸| 增强抵抗力吃什么| 林心如什么学历| 狼牙套是什么| 乌托邦是什么意思| 起义是什么意思| 肾水不足是什么意思| 就寝什么意思| 先心病是什么病| 血稠是什么原因造成的| 股癣是什么样的| 贫血是什么引起的| 血红蛋白偏低是什么意思| 条件反射是什么意思| 什么笑| 做梦梦见考试是什么意思| 遐想的意思是什么| 羞耻是什么意思| 人力资源是什么意思| 做梦吃屎有什么预兆| 大腿痛挂什么科| crayons什么意思| 低头头晕是什么原因| 做照影是检查什么| 行代表什么生肖| 去火喝什么茶| 腋下异味挂什么科| 身体游走性疼痛什么病| 为什么要喝酒| 文笔是什么意思| 身上长癣是什么原因引起的| 来之不易是什么意思| 丝瓜不能和什么食物一起吃| 喉咙发炎吃什么药最好| 电气石是什么东西| 吃什么可以缓解孕吐恶心| 为什么白带是黄绿色的| 心律平又叫什么名字| 什么书比较好| hbv病毒是什么意思| 腋下有异味用什么药| 儿童补钙吃什么| 饸烙面是什么面| 入职是什么意思| 牙疼吃什么药最管用| 尖酸刻薄什么意思| 有什么花| 一建什么时候报名| 腹胀吃什么药最有效| 灰配什么颜色好看| 忠贞不渝是什么意思| 武当山求什么最灵| 兔肉和什么相克| 闰月鞋买什么颜色| 般若波罗蜜是什么意思| rps是什么| 双鱼座的幸运色是什么| epa和dha是什么| 四不念什么| 赵云字什么| 1994年属狗的是什么命| 衣原体感染吃什么药| 粘胶纤维是什么| 尿检阴性是什么意思| 京东自营店什么意思| 右眼一直跳什么情况| 花椒有什么作用| 低钙血症是什么意思| 脸色苍白没有血色是什么原因| 胸透能查出什么| 10月30号什么星座| 足银是什么意思| 什么心什么目| prr是什么意思| 男朋友生日送什么礼物| 9月19日是什么星座| 彩头是什么意思| 福荫是什么意思| 什么什么的田野| 做脑ct对人体有什么危害| 入职体检挂什么科| 属龙的五行属性是什么| 北京户口有什么用| 一国两制是什么时候提出的| 胆结石忌吃什么| 重症医学科是干什么的| 什么的小鸡| 花青素是什么颜色| 验孕棒什么时候测最准确| 自己买什么药可以打胎| nacl是什么| 公主和郡主有什么区别| 经常吃红枣有什么好处和坏处| 政委是干什么的| 大象的天敌是什么动物| 什么的尾巴长不了歇后语| pt950是什么材质| 喝什么解辣| 体质是什么意思| 什么是cnc| 夏天适合用什么护肤品| 苏慧伦为什么不老| 什么车可以闯红灯| 大什么世界| 息肉和囊肿有什么区别| 苟活什么意思| cm什么单位| 阳虚是什么意思| 大爷是什么意思| 缩阳什么意思| 寅时属什么生肖| nfc果汁是什么意思| 弥散是什么意思| 筋膜炎吃什么药好| 无畏无惧是什么意思| 第一次见家长送什么礼物好| 为什么有胎记| 乙肝表面抗体偏高是什么意思| 9像什么| 胆汁淤积吃什么药| 歧途什么意思| 马齿苋治什么病| 湿气太重了吃什么药| 绞股蓝和什么搭配喝减肥| 输卵管堵塞有什么症状| hoka跑鞋中文叫什么| 安道尔微信暗示什么| 海龟是什么动物| 左脚麻是什么原因| 2018年属什么生肖| 水母吃什么| 风加具念什么| 出汗有盐霜是什么原因| 稷字五行属什么| soldier是什么意思| 挂科有什么影响| 超声波检查是什么检查| 飞花令是什么| 曱亢有什么症状| 什么花走着开| 睡眠瘫痪症是什么| 看病人送什么花合适| 属马的人佩戴什么招财| 牙齿上有黑点是什么原因| 郑板桥爱画什么| 元阳是什么意思| 口角炎缺乏什么维生素| 什么空如什么| 老是放屁吃什么药| 乙醇是什么东西| phoenix是什么牌子| 非油炸是什么意思| 瓜子脸适合什么刘海| 托帕石是什么| 左脚大拇指麻木是什么原因| 射线是什么| 亲子鉴定需要什么样本| 胸为什么一大一小| 为什么不结婚| 指甲发黄是什么原因| 就藩什么意思| 什么的哲理| 什么食物含维生素b| 腹泻是什么原因| 尿酸低是什么原因| epo是什么意思| 早期流产是什么症状| 铁蛋白高吃什么药能降下来| 尿糖一个加号是什么意思| 氧化铜什么颜色| 阴毛变白什么原因| 胃反酸吃什么食物好| 宰相相当于现在什么官| 1940年出生属什么生肖| 梦见牙齿掉了是什么意思| 世界第一长河是什么河| 末梢神经炎吃什么药| 血糖偏高吃什么食物好| 姓傅的男孩取什么名字| fzl什么意思| 泥鳅吃什么东西| 望闻问切什么意思| 男人分手是什么感觉| 市政协秘书长是什么级别| 波澜壮阔是什么意思| 暂住证需要什么材料| 硫酸亚铁是什么颜色| 异类是什么意思| 六味地黄丸有什么用| 良人是什么意思| 印度人属于什么人种| 为什么手臂上有很多很小的点| 肝掌是什么症状| 董五行属什么| 情趣是什么| 甲亢挂什么科| 大排畸主要检查什么| 孕妇适合吃什么| 胃不好吃什么水果好| 山竹什么人不能吃| 什么是马克杯| 梦到被狗咬是什么意思| 一个月一个元念什么| 万圣节应该送什么礼物| 山及念什么| beyond是什么意思| 阳虚是什么意思| 硬脂酸镁是什么东西| 剑突下是什么位置| 舒字五行属什么的| 国家为什么重视合肥| 女性更年期吃什么药| 1956年属什么| 95年猪是什么命| 失眠吃什么中成药效果最好| 百度Jump to content

福建省省长唐登杰:先行先试 加快建设新福建

From Wikipedia, the free encyclopedia
百度 依照电话中的指示,李女士找到了车辆。

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a sensor model (the probability distribution of different observations given the underlying state) and the underlying MDP. Unlike the policy function in MDP which maps the underlying states to the actions, POMDP's policy is a mapping from the history of observations (or belief states) to the actions.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The general framework of Markov decision processes with imperfect information was described by Karl Johan ?str?m in 1965[1] in the case of a discrete state space, and it was further studied in the operations research community where the acronym POMDP was coined. It was later adapted for problems in artificial intelligence and automated planning by Leslie P. Kaelbling and Michael L. Littman.[2]

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes the expected reward (or minimizes the cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

Definition

[edit]

Formal definition

[edit]

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple , where

  • is a set of states,
  • is a set of actions,
  • is a set of conditional transition probabilities between states,
  • is the reward function.
  • is a set of observations,
  • is a set of conditional observation probabilities, and
  • is the discount factor.

At each time period, the environment is in some state . The agent takes an action , which causes the environment to transition to state with probability . At the same time, the agent receives an observation which depends on the new state of the environment, , and on the just taken action, , with probability (or sometimes depending on the sensor model). Finally, the agent receives a reward equal to . Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: , where is the reward earned at time . The discount factor determines how much immediate rewards are favored over more distant rewards. When the agent only cares about which action will yield the largest expected immediate reward; when the agent cares about maximizing the expected sum of future rewards.

Discussion

[edit]

Because the agent does not directly observe the environment's state, the agent must make decisions under uncertainty of the true environment state. However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the current state. A consequence of this property is that the optimal behavior may often include (information gathering) actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in the future.

It is instructive to compare the above definition with the definition of a Markov decision process. An MDP does not include the observation set, because the agent always knows with certainty the environment's current state. Alternatively, an MDP can be reformulated as a POMDP by setting the observation set to be equal to the set of states and defining the observation conditional probabilities to deterministically select the observation that corresponds to the true state.

Belief update

[edit]

After having taken the action and observing , an agent needs to update its belief in the state the environment may (or not) be in. Since the state is Markovian (by assumption), maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current observation. The operation is denoted . Below we describe how this belief update is computed.

After reaching , the agent observes with probability . Let be a probability distribution over the state space . denotes the probability that the environment is in state . Given , then after taking action and observing ,

where is a normalizing constant with .

Belief MDP

[edit]

A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state. The resulting belief MDP will thus be defined on a continuous state space (even if the "originating" POMDP has a finite number of states: there are infinite belief states (in ) because there are an infinite number of probability distributions over the states (of )).[2]

Formally, the belief MDP is defined as a tuple where

  • is the set of belief states over the POMDP states,
  • is the same finite set of action as for the original POMDP,
  • is the belief state transition function,
  • is the reward function on belief states,
  • is the discount factor equal to the in the original POMDP.

Of these, and need to be derived from the original POMDP. is

where is the value derived in the previous section and

The belief MDP reward function () is the expected reward from the POMDP reward function over the belief state distribution:

.

The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.

Policy and value function

[edit]

Unlike the "originating" POMDP (where each action is available from only one state), in the corresponding Belief MDP all belief states allow all actions, since you (almost) always have some probability of believing you are in any (originating) state. As such, specifies an action for any belief .

Here it is assumed the objective is to maximize the expected total discounted reward over an infinite horizon. When defines a cost, the objective becomes the minimization of the expected cost.

The expected reward for policy starting from belief is defined as

where is the discount factor. The optimal policy is obtained by optimizing the long-term reward.

where is the initial belief.

The optimal policy, denoted by , yields the highest expected reward value for each belief state, compactly represented by the optimal value function . This value function is solution to the Bellman optimality equation:

For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex.[3] It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate arbitrarily closely, whose shape remains convex. Value iteration applies dynamic programming update to gradually improve on the value until convergence to an -optimal value function, and preserves its piecewise linearity and convexity.[4] By improving the value, the policy is implicitly improved. Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.[5][6]

Approximate POMDP solutions

[edit]

In practice, POMDPs are often computationally intractable to solve exactly. This intractability is often due to the curse of dimensionality or the curse of history (the fact that optimal policies may depend on the entire history of actions and observations). To address these issues, computer scientists have developed various approximate POMDP solutions.[7] These solutions typically attempt to approximate the problem or solution with a limited number of parameters, plan only over a small part of the belief space online, or summarize the action-observation history compactly.

Grid-based algorithms[8] comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states.[9][10] For example, adaptive grids and point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.[11][12] Dimensionality reduction using PCA has also been explored.[13]

Online planning algorithms approach large POMDPs by constructing a new policy for the current belief each time a new observation is received. Such a policy only needs to consider future beliefs reachable from the current belief, which is often only a very small part of the full belief space. This family includes variants of Monte Carlo tree search[14] and heuristic search.[15] Similar to MDPs, it is possible to construct online algorithms that find arbitrarily near-optimal policies and have no direct computational complexity dependence on the size of the state and observation spaces.[16]

Another line of approximate solution techniques for solving POMDPs relies on using (a subset of) the history of previous observations, actions and rewards up to the current time step as a pseudo-state. Usual techniques for solving MDPs based on these pseudo-states can then be used (e.g. Q-learning). Ideally the pseudo-states should contain the most important information from the whole history (to reduce bias) while being as compressed as possible (to reduce overfitting).[17]

POMDP theory

[edit]

Planning in POMDP is undecidable in general. However, some settings have been identified to be decidable (see Table 2 in,[18] reproduced below). Different objectives have been considered. Büchi objectives are defined by Büchi automata. Reachability is an example of a Büchi condition (for instance, reaching a good state in which all robots are home). coBüchi objectives correspond to traces that do not satisfy a given Büchi condition (for instance, not reaching a bad state in which some robot died). Parity objectives are defined via parity games; they enable to define complex objectives such that reaching a good state every 10 timesteps. The objective can be satisfied:

  • almost-surely, that is the probability to satisfy the objective is 1;
  • positive, that is the probability to satisfy the objective is strictly greater than 0;
  • quantitative, that is the probability to satisfy the objective is greater than a given threshold.

We also consider the finite memory case in which the agent is a finite-state machine, and the general case in which the agent has an infinite memory.

Objectives Almost-sure (infinite memory) Almost-sure (finite memory) Positive (inf. mem.) Positive (finite mem.) Quantitative (inf. mem) Quantitative (finite mem.)
Büchi EXPTIME-complete EXPTIME-complete undecidable EXPTIME-complete[18] undecidable undecidable
coBüchi undecidable EXPTIME-complete[18] EXPTIME-complete EXPTIME-complete undecidable undecidable
parity undecidable EXPTIME-complete[18] undecidable EXPTIME-complete[18] undecidable undecidable

Applications

[edit]

POMDPs can be used to model many kinds of real-world problems. Notable applications include the use of a POMDP in management of patients with ischemic heart disease,[19] assistive technology for persons with dementia,[9][10] the conservation of the critically endangered and difficult to detect Sumatran tigers[20] and aircraft collision avoidance.[21]

One application is a teaching case, a crying baby problem, where a parent needs to sequentially decide whether to feed the baby based on the observation of whether the baby is crying or not, which is an imperfect representation of the actual baby's state of hunger.[22][23]

References

[edit]
  1. ^ ?str?m, K.J. (1965). "Optimal control of Markov processes with incomplete state information". Journal of Mathematical Analysis and Applications. 10: 174–205. doi:10.1016/0022-247X(65)90154-X.
  2. ^ a b Kaelbling, L.P., Littman, M.L., Cassandra, A.R. (1998). "Planning and acting in partially observable stochastic domains". Artificial Intelligence. 101 (1–2): 99–134. doi:10.1016/S0004-3702(98)00023-X.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. ^ Sondik, E.J. (1971). The optimal control of partially observable Markov processes (PhD thesis). Stanford University. Archived from the original on October 17, 2019.
  4. ^ Smallwood, R.D., Sondik, E.J. (1973). "The optimal control of partially observable Markov decision processes over a finite horizon". Operations Research. 21 (5): 1071–88. doi:10.1287/opre.21.5.1071.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  5. ^ Sondik, E.J. (1978). "The optimal control of partially observable Markov processes over the infinite horizon: discounted cost". Operations Research. 26 (2): 282–304. doi:10.1287/opre.26.2.282.
  6. ^ Hansen, E. (1998). "Solving POMDPs by searching in policy space". Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98). arXiv:1301.7380.
  7. ^ Hauskrecht, M. (2000). "Value function approximations for partially observable Markov decision processes". Journal of Artificial Intelligence Research. 13: 33–94. arXiv:1106.0234. doi:10.1613/jair.678.
  8. ^ Lovejoy, W. (1991). "Computationally feasible bounds for partially observed Markov decision processes". Operations Research. 39: 162–175. doi:10.1287/opre.39.1.162.
  9. ^ a b Jesse Hoey; Axel von Bertoldi; Pascal Poupart; Alex Mihailidis (2007). "Assisting Persons with Dementia during Handwashing Using a Partially Observable Markov Decision Process". Proceedings of the International Conference on Computer Vision Systems. doi:10.2390/biecoll-icvs2007-89.
  10. ^ a b Jesse Hoey; Pascal Poupart; Axel von Bertoldi; Tammy Craig; Craig Boutilier; Alex Mihailidis. (2010). "Automated Handwashing Assistance For Persons With Dementia Using Video and a Partially Observable Markov Decision Process". Computer Vision and Image Understanding. 114 (5): 503–519. CiteSeerX 10.1.1.160.8351. doi:10.1016/j.cviu.2009.06.008.
  11. ^ Pineau, J., Gordon, G., Thrun, S. (August 2003). "Point-based value iteration: An anytime algorithm for POMDPs" (PDF). International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. pp. 1025–32.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  12. ^ Hauskrecht, M. (1997). "Incremental methods for computing bounds in partially observable Markov decision processes". Proceedings of the 14th National Conference on Artificial Intelligence (AAAI). Providence, RI. pp. 734–739. CiteSeerX 10.1.1.85.8303.
  13. ^ Roy, Nicholas; Gordon, Geoffrey (2003). "Exponential Family PCA for Belief Compression in POMDPs" (PDF). Advances in Neural Information Processing Systems.
  14. ^ David Silver and Joel Veness (2010). Monte-Carlo planning in large POMDPs. Advances in neural information processing systems.
  15. ^ Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee (2017). "DESPOT: Online POMDP Planning with Regularization". Journal of Artificial Intelligence Research. 58: 231–266. arXiv:1609.03250. doi:10.1613/jair.5328.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  16. ^ Michael H. Lim, Tyler J. Becker, Mykel J. Kochenderfer, Claire J. Tomlin, and Zachary N. Sunberg (2023). "Optimality Guarantees for Particle Belief Approximation of POMDPs". Journal of Artificial Intelligence Research. 77: 1591–1636. arXiv:2210.05015. doi:10.1613/jair.1.14525.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  17. ^ Francois-Lavet, V., Rabusseau, G., Pineau, J., Ernst, D., Fonteneau, R. (2019). On overfitting and asymptotic bias in batch reinforcement learning with partial observability. Journal of Artificial Intelligence Research. Vol. 65. pp. 1–30. arXiv:1709.07796.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  18. ^ a b c d e Chatterjee, Krishnendu; Chmelík, Martin; Tracol, Mathieu (2025-08-06). "What is decidable about partially observable Markov decision processes with ω-regular objectives". Journal of Computer and System Sciences. 82 (5): 878–911. doi:10.1016/j.jcss.2016.02.009. ISSN 0022-0000.
  19. ^ Hauskrecht, M., Fraser, H. (2000). "Planning treatment of ischemic heart disease with partially observable Markov decision processes". Artificial Intelligence in Medicine. 18 (3): 221–244. doi:10.1016/S0933-3657(99)00042-1. PMID 10675716.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  20. ^ Chadès, I., McDonald-Madden, E., McCarthy, M.A., Wintle, B., Linkie, M., Possingham, H.P. (16 September 2008). "When to stop managing or surveying cryptic threatened species". Proc. Natl. Acad. Sci. U.S.A. 105 (37): 13936–40. Bibcode:2008PNAS..10513936C. doi:10.1073/pnas.0805265105. PMC 2544557. PMID 18779594.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. ^ Kochenderfer, Mykel J. (2015). "Optimized Airborne Collision Avoidance". Decision Making Under Uncertainty. The MIT Press.
  22. ^ Kochenderfer, Mykel J.; Wheeler, Tim A.; Wray, Kyle H. (2022). Algorithms for decision making. Cambridge, Massachusetts; London, England: MIT Press. p. 678. ISBN 9780262047012.
  23. ^ Moss, Robert J. (Sep 24, 2021). "WATCH: POMDPs: Decision Making Under Uncertainty POMDPs.jl. Crying baby problem" (video). youtube.com. The Julia Programming Language.
[edit]
什么叫县级以上的医院 双向转诊是什么意思 羊病是什么病 蟑螂的天敌是什么 一个益一个蜀念什么
师团长是什么级别 头痒用什么东西洗头最好 铁蛋白是查什么的 牙根出血是什么原因 南京有什么特色特产
吃降压药有什么副作用 hpv病毒是什么原因引起的 舍曲林是什么药 mid什么意思 必要性是什么意思
什么是丁克 大肠湿热吃什么中成药 舌尖有点麻是什么原因 糖尿病人可以吃什么零食 吃什么吐什么
慢性肠炎吃什么药hcv9jop0ns9r.cn 看空是什么意思hcv7jop4ns8r.cn 肺主治节是什么意思hcv7jop4ns5r.cn 槟榔吃多了有什么危害hcv8jop1ns5r.cn 刚怀孕吃什么水果对胎儿好hcv9jop8ns1r.cn
难入睡是什么原因weuuu.com ca125是什么意思hcv9jop2ns6r.cn 什么叫唐卡hcv9jop0ns9r.cn 北京为什么这么热hcv7jop5ns2r.cn 沉默是什么意思hcv8jop9ns2r.cn
感冒吃什么好的快hcv9jop6ns3r.cn 肝硬化前期有什么症状hcv8jop6ns2r.cn sun代表什么hcv8jop3ns0r.cn b超fl是什么意思hcv9jop1ns8r.cn 黄芪泡水喝有什么作用hcv8jop1ns2r.cn
体外射精是什么hcv9jop7ns0r.cn 异想天开是什么意思cj623037.com 欲言又止什么意思xinmaowt.com 侍郎是什么官hcv9jop2ns4r.cn 咽喉炎用什么药hcv7jop5ns3r.cn
百度